Download - A Parallel Heterogeneous Approach to Perturbative Monte ... · A Parallel Heterogeneous Approach to Perturbative Monte Carlo QM/MM Simulations Sebastiao Salvador de Miranda˜ Thesis

A Parallel Heterogeneous Approach toPerturbative Monte Carlo QM/MM Simulations

Sebastiao Salvador de Miranda

Thesis to obtain the Master of Science Degree in

Electrical and Computer Engineering

Supervisors: Dr. Pedro Filipe Zeferino Tomas,

Dr. Nuno Filipe Valentim Roma

Examination Committee

Chairperson: Dr. Nuno Cavaco Gomes HortaSupervisor: Dr. Pedro Filipe Zeferino Tomas

Members of the Committee: Dr. Gabriel Falcao Paiva Fernandes

October 2014

Acknowledgments

Foremost, I would like to thank my supervisors, Doctor Pedro Tomas, Doctor Nuno Roma and Doctor

Frederico Pratas, who have provided me with invaluable guidance. I would also like to thank Doctor

Gabriel Falcao, who has reviewed the intermediate report of this dissertation and provided several in-

sightful comments.

I would like to express my gratitude to Doctor Ricardo Mata, who enlighted me on several occasions

about computational chemistry aspects, and invited me to spend a very pleasant month of research at

the Free Floater Research Group, Institut fur Physikalische Chemie, Georg-August-Universitat Gottin-

gen, Germany. Furthermore, I would like to thank Jonas Feldt, who helped me to achieve a greater

understanding of the PMC QM/MM simulation method, and with whom I have intensively collaborated

in writing research articles and developing new simulation features. I would also like to thank my col-

leagues Tomas Ferreirinha, David Nogueira, Francisco Gaspar, Andriy Gorobets and Joao Silva, with

whom I have discussed a multitude of topics, doubts and ideas, during the development of my disser-

tation. Furthermore, I would like to thank Joao Guerreiro and Luıs Tanica for having helped me in the

development of power and energy measurement techniques.

Special thanks to my girlfriend Mafalda Coelho, who has endured several months of listening to dry

technical details about my dissertation. I would also like to thank my father Pedro Miranda and my mother

Ana Salvador, for having discussed with me several topics on matters of biology, chemistry, physics and

computation.

Finally, I would like to express my gratitude to INESC-id and Institut fur Physikalische Chemie for

having given me access to their infrastructure, namely their high performance computing platforms.

Furthermore, the work presented herein was partially supported by national funds through Fundacao

para a Ciencia e a Tecnologia (FCT) under projects Threads (ref. PTDC/EEA-ELC/117329/2010) and

P2HCS (ref. PTDC/EEI-ELC/3152/2012).

ABSTRACT

Molecular simulations play an increasingly important role in computational chemistry, computational

biology and computer aided drug design. However, traditional single core implementations hardly sat-

isfy the current needs, due to the prolonged runs that often arise for not exploiting the intrinsic data and

task parallelism of some of these methods. To address this limitation, a new heterogeneous parallel

solution to Monte Carlo (MC) molecular simulations is herein introduced, exploiting fine-grained par-

allelism in the inner structure of the bottleneck procedures, and coarse-grained parallelism in the MC

state-space sampling. Unlike typical high-performance pure Quantum Mechanics (QM) or Molecular

Mechanics (MM) parallelization approaches, the work herein presented focuses on accelerating a novel

Perturbative Monte Carlo (PMC) mixed QM/MM application. The hybrid nature of the proposed parallel

approach warrants an efficient use of heterogeneous systems, composed by single or multiple CPUs

and heterogeneous accelerators (e.g., GPUs), by relying on the multi-platform OpenCL programming

framework. To efficiently exploit the parallel architecture, load balancing schemes were employed to

schedule the work between the available accelerators. A speed-up of 56× is achieved in the compu-

tational bottleneck for the case of a relevant chorismate dataset, when compared with an optimized

single-core implementation. A speed-up of 38× is observed for the full simulation, using both multi-core

CPUs and GPUs, thus effectively reducing the execution time of the full simulation from ∼80 hours to ∼2

hours.

Keywords

Quantum Mechanics (QM), Molecular Mechanics (MM), Monte Carlo (MC) Simulations, Parallel

Computing, Heterogeneous Computing, OpenCL.

iii

RESUMO

As simulacoes moleculares desempenham um papel cada vez mais importante na quımica e bi-

ologia computacionais e no desenvolvimento de farmacos assistido por computador. No entanto, as

implementacoes tradicionais single-core tem execucoes muito prolongadas, nao aproveitando o para-

lelismo de dados e de tarefas intrinsecamente presente nalguns destes metodos. De forma a colmatar

esta limitacao, este trabalho introduz uma solucao paralela e heterogenea para simulacoes moleculares

Monte Carlo (MC), explorando o paralelismo fine-grained na estrutura interna do bottleneck computa-

cional e o paralelismo coarse-grained na amostragem do espaco de estados de MC. Ao contrario de

abordagens tıpicas de alta performance a algoritmos puros de Quantum Mechanics (QM) ou Molecular

Mechanics (MM), este trabalho concentra-se na aceleracao de uma novo metodo Perturbative Monte

Carlo (PMC) mixed QM/MM. A natureza hıbrida da abordagem paralela proposta permite o uso de ar-

quiteturas heterogeneas, compostas por um ou varios CPUs e aceleradores heterogeneos (e.g. GPUs),

tirando partido da biblioteca multi-plataforma OpenCL. De forma a explorar eficazmente arquiteturas he-

terogeneas, foram aplicados esquemas de Load Balancing para distribuir a carga computacional pelos

aceleradores disponıveis. E atingido um speed-up de 56× no bottleneck computacional para o caso

de um chorismate dataset relevante na area, quando comparado com uma implementacao single-core

otimizada. No caso da simulacao completa, e observado um speed-up de 38×, tirando partido de multi-

core CPUs e GPUs. O tempo total desta simulacao foi assim reduzido de ∼80 horas para ∼2 horas.

Palavras Chave

Mecanica Quantica, Mecanica Molecular, Simulacoes Monte Carlo, Computacao Paralela, Computacao

Heterogenea, OpenCL

v

CONTENTS

1 Introduction 11.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Main Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Heterogeneous Computing 72.1 Multi-Core General-Purpose Processors (GPP) Architecture . . . . . . . . . . . . . . . . . 82.2 Graphical Processing Unit (GPU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 AMD and Nvidia Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Platform Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.2 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.3 Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.4 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.5 OpenCL Runtime Parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 Load Balancing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Perturbative Monte Carlo QM/MM 183.1 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Computational Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Data Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Parallel Heterogeneous Solution 284.1 Original PMC QM/MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Exploiting Markov Chain Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2.1 Multiple Markov Chain Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3 Parallelization Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3.1 OpenCL Host-Side Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3.1.A Load Balancing Among Multiple Markov Chains . . . . . . . . . . . . . . 34

4.4 Data Structure Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

vii

4.4.1 Indexing Molecules and Atoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.4.2 Computing Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5 Fine-Grained Parallelism and Multi-Device Load Balancing 405.1 PMC Cycle Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1.1 Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.1.2 Coulomb Grid QM/MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.1.3 Coulomb/VDW MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.1.4 Coulomb Nuclei/VDW QM/MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.1.5 Decision Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2 Exploiting Single Markov Chain Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 475.2.1 Multiple OpenCL Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.2.2 Dynamic Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2.2.A Problem Partitioning Approaches . . . . . . . . . . . . . . . . . . . . . . 515.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6 Experimental Evaluation 526.1 Benchmarking Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.1.1 Chemical Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.1.2 Hardware Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.1.3 Performance Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.2 PMC Cycle Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.2.1 PMC Cycle Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.2.2 PMC Cycle Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.3 Global PMC Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.4 Numerical Evaluation: Convergence Accuracy and Energy Consumption . . . . . . . . . . 636.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7 Conclusions 667.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

viii

LIST OF FIGURES

2.1 Example CPU with 4 cores and 3 levels of cache. . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Organization of the AMD Southern Islands GPU architecture. . . . . . . . . . . . . . . . . 10

2.3 GTX 680 device Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 OpenCL Platform Model [22]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Partitioning of work-items into work-groups. . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.6 Partitioning of work-items into work-groups. . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.7 Example of an heterogeneous network composed by several compute nodes, each com-prised by multiple Central Processing Unit (CPU) cores and one or more specialized ac-celerators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 A system composed of one QM molecule (C) and two MM solvent molecules (A and B).For each MC step, the difference in energy between the molecule moved (A) and everyother molecule has to be computed, but at different levels of theory. . . . . . . . . . . . . 20

3.2 Perturbative Monte Carlo QM/MM with focus on the simulation bottleneck (PMC cycle,right). Arrows represent data dependencies. . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Main data structures used in the PMC Cycle. Refer to Table 3.1 for parameter definitions. 24

3.4 Data dependencies within the PMC Cycle. The VDW QMMM and Coulomb Nuclei QMMMprocesses only read the atoms that are part of the QM molecule, not the whole lattice. . . 25

4.1 Independent MC state-space exploration chains (illustrative example for 2 chains), eachgenerating an independent sampling of the conformal space of the target QM/MM system. 30

4.2 MC State-Space alongside with the execution timeline for three Markov chains. . . . . . . 31

4.3 Simultaneous exploitation of chain-level, task-level and data-level parallelism in the PMCQM/MM method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.4 Multi-process/multi-threading structure of the designed parallel solution for the PMC method(right), alongside with the original dual-process approach (left). . . . . . . . . . . . . . . . 33

4.5 Program flow of the devised parallel PMC program, for the case of a single-device single-process instance (in order to keep the illustration clear). The legend for the numberedparts of this figure is presented throughout the text. . . . . . . . . . . . . . . . . . . . . . . 35

4.6 mol2atom data structure, together with the lattice vectors. The mol2atom structure returnsthe index of the first atom belonging to the target molecule, which can then be used toindex the lattice vectors, which contain the {x, y, z, σ, ε, q} data. . . . . . . . . . . . . . . . 37

ix

4.7 Original approach to distance computation (left), together with the devised on-the-fly so-lution (right). For the sake of clarity, the distance computation procedures were singledout, although they are executed in the same computation loop as the Coulomb/VDW pro-cedures. The remaining procedures of the PMC Cycle step have been omitted for thesake of clarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1 Mapping of the PMC Cycle procedures into OpenCL Kernels. It should be noticed thatsome procedures were merged into the same kernel. Furthermore, the OpenCL versionrequires additional kernels for the parallel reductions (mm finish and q3m finish, markedwith a ∗). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 Memory layout example for the main data structures used in the PMC Cycle. . . . . . . . 42

5.3 Diagram for the devised monte carlo kernel, together with the layout of the data which ismanipulated in this procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.4 Scheme used for partitioning the grid among the work-groups, in order to allow a coa-lesced memory access pattern. For the sake of keeping the illustration clear, an examplefor P = 2 and wgsize = 4 is shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.5 q3m c and q3m finish kernels structure. In this example, work-group 0 was presentedwith additional detail, although all work-groups share an identical structure. Likewise, the8 work-items per work-group configuration was adopted for simpler illustrative purposes,as the work-group size is fully parameterizable. Furthermore, additional details concern-ing the first global memory accesses (label 1) are depicted in Figure 5.4. . . . . . . . . . . 45

5.6 q3m vdwc kernel structure. An 8 work-items per work-group configuration was adoptedfor simpler illustrative purposes, as the work-group size is fully parameterizable. . . . . . . 46

5.7 decide update kernel diagram. An 8 work-items per work-group configuration was adoptedfor simpler illustrative purposes, as the work-group size is fully parameterizable. . . . . . . 47

5.8 Exploiting multiple heterogeneous OpenCL devices to execute the PMC Cycle. The exe-cution is balanced by executing different kernels on each device and dividing the work ofthe heavier kernels (q3m c and q3m reduce). . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.9 Work-flow of the centralized predicting-the-future dynamic load balancing solution em-ployed in this dissertation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.1 Time footprint for a single PMC Cycle step for the bench-A dataset running on the avx2-baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.2 One complete PMC outer iteration, comprised of 10k PMC Cycle steps and a QM Up-date, for the bench-A dataset running on the avx2-baseline. The bottleneck of each PMCiteration is the PMC Cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.3 Speed-up obtained for a PMC Cycle with 10k iterations, when using fp64 -fp32 mixed-precision. The Corresponding execution times are presented in Table 6.3. . . . . . . . . . 57

6.4 OpenCL kernel timings (per step) for the PMC Cycle running on the mcx2 heterogeneousplatform. The load is balanced for the heavier kernels (q3m c/q3m finish, correspondingto Coulomb QM/MM), whereas the lighter kernels were scheduled to the first GraphicalProcessing Unit (GPU). The considered benchmark is bench-A, using mixed fp64 -fp32

precision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.5 Convergence pattern of the implemented load balancing algorithm (balancing every 2000steps), for the Bench-C running on the GTX 780Ti/660Ti platform (mcx2). The presentedPMC cycle time measurements represent mean times since the previous balancing. . . . 59

x

6.6 Scalability of the PMC Cycle when changing the size of the QM part in bench-A. Speed-upresults are presented for a dual GTX680 system in respect to a single GTX680 (platformmcx4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.7 QM/MM Simulation box for the bench-R dataset (partial representation), together with thesimulation results for the conversion of the chorismate structure into prephenate. . . . . . 61

xi

LIST OF TABLES

3.1 QM/MM Run Characterization, together with the typical parameter range for the bench-marks considered in this work. For the case of homogeneous solvents, the Z(i)

MM param-eter (concerning molecule i) will be the same for every MM molecule. . . . . . . . . . . . 21

5.1 Complexity of communication and synchronization overheads, in respect to the QM/MMsystem characteristics and to run parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.1 Considered QM/MM benchmark datasets. The chemical aspects of bench-R are pre-sented in detail in [11]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.2 Considered execution platforms in the experimental evaluation. . . . . . . . . . . . . . . . 54

6.3 Execution time (in seconds) for a PMC Cycle with 10k steps, in several hardware plat-forms, when using fp64 -fp32 mixed-precision. The column ”Total” corresponds to thecomplete execution times of the PMC Cycle (10k steps), including the final serial over-head of reading back and writing the output to a file. This overhead is discriminated incolumn ”Output”. The presented execution times correspond to a median among fourexperimental trials, for each platform configuration. . . . . . . . . . . . . . . . . . . . . . . 56

6.4 Kernel execution times obtained in the GTX780Ti accelerator and the in the referenceavx2-baseline platform, for the particular case of bench-A. The speed-up in respect to theavx2-baseline is also presented, together with the fraction of the PMC Cycle (%) eachkernel represents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.5 bench-R execution time for the PMC Cycle (50k steps) and QM Update (24.8M iters)stages, as well as for the full PMC simulation. The presented results consider two base-lines and four parallel solutions, with either a single or 8 Markov chains and fp64 orfp64 -fp32 precision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.6 Performance speed-ups for bench-R, considering the execution times presented in Ta-ble 6.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.7 Speed-up of the mixed precision q3m c kernel versions versus the original fp64 version,running on the same machine, for the case of bench-A. . . . . . . . . . . . . . . . . . . . 64

6.8 Obtained numerical precision. The error is shown for the ∆ECQM/MM energy term, as well

as for the total energy of the system (E), when considering the em = 1.0 × 10−1kJ/mol

maximum error. The average values were taken from the complete set of generatedQM/MM systems, by using bench-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

xiii

6.9 Execution time speed-up, energy savings and average power consumption, when com-paring the Tesla K20C GPU running all the devised numerical precision approaches withavx2-baseline (with the original fp64 precision). The testbench was run on the K20C GPUfor 100k steps, in order to ensure a representative sampling of the computational cost ofq3m c. The default core frequency configuration was used for all experiments. . . . . . . 65

xiv

LIST OF ACRONYMS

MD Molecular Dynamics

MC Monte Carlo

PMC Perturbative Monte Carlo

DMC Diffusion Monte Carlo

VMC Variational Monte Carlo

AFMM Adaptive Fast Multipole Method

MM Molecular Mechanics

QM Quantum Mechanics

QMC Quantum Monte Carlo

vdW Van der Waals

CPU Central Processing Unit

GPU Graphical Processing Unit

DSP Digital Signal Processor

FPGA Field Programmable Gate Array

ILP Instruction Level Parallelism

SIMD Single Instruction Multiple Data

MIMD Multiple Instruction Multiple Data

SPMD Single Program Multiple Data

PE Processing Element

CU Computing Unit

GPC Graphics Processing Cluster

PC Program Counter

xv

LDS Local Data Share

SI Southern Islands

SMX Streaming Multiprocessor

CC Compute Capability

HPC High Performance Computing

xvi

CHAPTER 1

INTRODUCTION

Contents1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Main Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1

Computer simulations have become standard tools in chemical research, allowing for the prediction

of complex molecular structures, together with a comprehensive characterization of their properties.

Using methods from theoretical chemistry, where mathematics and physics are used to study chemi-

cal processes, computational chemistry studies the properties of a chemical system, describing inter

molecule interaction, geometrical arrangements and other chemically related problems. A particular

case of molecular computer simulation is drug docking simulation, which is a method that predicts the

preferred configurations of two molecules when bounding to each other and plays a crucial role in the

lengthy process of computer-aided drug design [18], to which thousands of lives are tied.

In molecular computer simulations, execution time and memory scale rapidly with the size of the

system being simulated, leading to prolonged runs (sometimes in the order of weeks or months) and

resulting in wasted time and energy. Advances in this field are not only due to recent developments of

physical models, but also to the advances in computing systems, which substantially reduce computa-

tional time. Mature implementations of computational science software are usually highly optimized for

traditional single core Central Processing Unit (CPU) architectures, and therefore are intrinsically limited

by advances in single core execution time. To tackle this limitation, more recent High Performance Com-

puting (HPC) solutions have been exploiting the advances in parallel and heterogeneous computing,

using parallel platforms such as multi-core CPUs and many-core Graphical Processing Units (GPUs)

and specialized accelerators such as Field Programmable Gate Arrays (FPGAs) and Digital Signal Pro-

cessors (DSPs).

Molecular simulations are commonly based on Molecular Dynamics (MD) or on the Monte Carlo (MC)

method. MD simulates the system by calculating the forces acting on each atom, applying classical

mechanics to compute the resulting velocities, which are subsequently used to evolve the system in time.

MD allows the study of a wide range of dynamical properties, such as the conformational landscape

of a molecule. However, usable results can only be obtained by using very small simulation steps

(in the order of the femtosecond), which limits the system simulation to the order of microseconds.

Conversely, the Metropolis MC method [32] samples the system in the ensemble space, rather than

following a time coordinate. With this method, a sequence of random configurations is obtained on

the basis of Maxwell-Boltzmann statistics, by performing random movements at each frame and by

evaluating the corresponding change of the system energy. The resulting set is then analysed from

the perspective of the specific thermodynamic property under consideration. Even though MC does not

enable the computation of dynamical quantities, it allows studying processes with longer timescales, for

which sampling in time would be unfeasible.

Accordingly, the underlying method for calculating the energy of a given molecular structure can vary

with the system and the properties under study. The choice can fall to traditional Molecular Mechan-

ics (MM), Quantum Mechanics (QM) or mixed QM/MM methods. MM approaches represent atoms and

molecules through ball and spring models, with heavily parameterized functions to describe their inter-

actions. However, such approach may lead to several limitations. For example, atomic bonds have to be

kept throughout each simulation, thus preventing the chemical reaction to be modeled in a single run.

Alternatively, QM approaches explicitly simulate the electrons, at a cost of a much higher computational

2

burden, as it involves obtaining approximate solutions to the Schrodinger equation [21]. Furthermore,

the computation cost of most QM methods scales exponentially with the system size, thus impeding the

modelling of more complex structures. An alternative solution consists of a mixed QM/MM approach,

which combines the strengths of each method. In this case, a small active region is simulated with

QM, while the remaining environment is represented by classical MM. Nevertheless, the combination

of the mixed QM/MM terms with the pure QM and MM terms that co-exist in this approach usually re-

sult in a very computationally diverse algorithm containing both heavy single threaded code and several

opportunities to exploit task and data parallelism.

Ongoing collaboration between INESC-ID, Instituto Superior Tecnico, Universidade de Lisboa, and

the Institut fur Physikalische Chemie, Georg-August-Universitat Gottingen, led to this work, with the ob-

jective of accelerating their novel algorithm for Perturbative Monte Carlo (PMC) mixed QM/MM simulation

of periodic systems. Generally, the purpose of this model is to explicitly describe a system composed

of a solvent and a solute, by using the Metropolis MC sampling and a mixed QM/MM method for the

energy calculation [14]. An important application of such a strategy is the simulation of the docking of

drugs in the active site of a protein using QM, taking the surrounding environment into account explicitly

using MM. However, the original serial implementation of the PMC QM/MM suffers from extremely long

execution times, thus severely limiting the theoretical chemistry research on large (and realistic) QM/MM

systems.

1.1 Objectives

The objective of this MSc thesis is to accelerate the PMC QM/MM algorithm by designing an efficient

and scalable parallel implementation for heterogeneous architectures comprised by multi-core CPUs

and GPUs. Furthermore, performance will be evaluated in several system configurations, by studying

molecular simulations relevant to the Theoretical Chemistry field of application. The obtained speed-

up in respect to the original serial version will be the major metrics of interest, although the consumed

energy and the resulting numerical precision will also be discussed. Following are the fundamental

objectives of this work:

(i) Devise a parallel approach for the Perturbative Monte Carlo QM/MM simulation method.

(ii) Enable the efficient exploitation of heterogeneous hardware platforms.

(iii) Ensure a good scalability of the devised solution with the available computational resources.

(iv) Assess the performance of the developed solution.

3

1.2 Main Contribution

By addressing the objectives presented in section1.1, the main contributions of this work are the

following:

(i) First parallel heterogeneous solution for the Perturbative Monte Carlo QM/MM method. The de-

vised approach uses the OpenCL framework to parallelize the bottleneck procedures of the PMC

algorithm, enabling computational chemistry researchers to use a wider variety of platforms (in

comparison to when using CUDA or other vendor-specific frameworks).

(ii) Acceleration procedure based on a simultaneous exploitation of fine-grained (at the data level),

course-grained (at the Markov chain level) and task-grained (pure QM, pure MM and QM/MM pro-

cedures) parallelism to achieve an heterogeneous solution for platforms composed by Multi-Core

CPUs and GPUs. Furthermore, a performance-aware dynamic load balancing algorithm was em-

ployed to enable the full exploitation of the computing power of all the heterogeneous devices in a

given heterogeneous platform.

(iii) Parallel method for sampling the MC state-space, by using a multiple Markov chain exploration

scheme to effectively exploit course-grained parallelism in the available CPUs cores. This solution

proved to scale with an efficiency of about 85%. Furthermore, each group of CPU cores can share

one or more GPU accelerators, which run the simulation bottleneck (PMC Cycle) with a speed-up

ranging from about 13× to 152×.

(iv) Evaluation of energy saving and acceleration opportunities based on the adaptation of the numeri-

cal precision used by the algorithm, when considering either double or single-precision floating-point

or fixed-point representations. Such study was integrated in the analysis of performance, numerical

quality and power. The devised mixed precision solutions offer up to 2.7× speed-up and save up to

2.8× energy in the bottleneck kernels, when comparing to the double precision version. In respect

to the baseline PMC implementation, energy savings reach up to 28.8×.

(v) Assessment of the quality of the devised solution with several benchmarks relevant to the Theoret-

ical Chemistry field of application. The designed parallel approach was tested on several different

system configurations composed by Nvidia GPUs, AMD GPUs and Intel CPUs, resulting in the

same chemical results as the original serial implementation, with numerical differences far below

the maximum acceptable error. For the longest QM/MM simulation herein discussed, the parallel

solution effectively reduced the full execution time of the PMC from ∼80hours to ∼2hours.

The cumulative contributions of this thesis to the scientific community have resulted in two research

articles. The first has already been submitted for publication in an international peer-reviewed journal,

whereas the second is awaiting submission:

• Sebastiao Miranda, Jonas Feldt, Frederico Pratas, Ricardo Mata, Nuno Roma, and Pedro Tomas,

”Efficient Parallelization of Perturbative Monte Carlo QM/MM Simulations in Heterogeneous Plat-

forms”, International Journal of High Performance Computing Applications (submitted).

4

• Jonas Feldt, Sebastiao Miranda, Joao C. A. Oliveira, Frederico Pratas, Nuno Roma, Pedro Tomas,

Ricardo A. Mata, ”Perturbative Monte Carlo mixed Quantum Mechanics/Molecular Mechanics”

Journal of Chemical Information and Modeling (to be submitted).

In addition, the resulting application is now being actively used by the Free Floater Research Group -

Computational Chemistry and Biochemistry, Institut fur Physikalische Chemie, Georg-August-Universitat

Gottingen, for further scientific studies. The resulting parallel program package will be released under

the BSD-3-clause open source licence.

1.3 Document Outline

In Chapter 2, an overview of the current state-of-the-art CPU and GPU hardware is presented, as

well as a review of the literature on load balancing algorithms and a description of the OpenCL frame-

work. Chapter 3 presents a detailed description of the PMC QM/MM algorithm from a computational

point of view, and includes a discussion on the related work on accelerating computational chemistry

algorithms. In Chapter 4, a multi-device heterogeneous solution is introduced, and the strategy for ex-

ploiting multiple GPUs and CPU cores to execute multiple Markov chains is presented. In Chapter 5, the

developed OpenCL approach to the simulation bottleneck (PMC Cycle) is discussed in detail, as well as

a dynamic load balancing solution. In Chapter 6, the performance of the developed solution is evaluated

with a set of chemical benchmarks over a wide range of hardware configurations. Furthermore, an anal-

ysis of the scalability is made for both single and multiple Markov Chain solutions, and the numerical

representation impact on the execution time, numerical quality and energy consumption is analyzed.

Finally, in Chapter 7, the conclusions of the presented work are drawn and the future work is discussed.

5

CHAPTER 2

HETEROGENEOUS COMPUTINGARCHITECTURES

Contents2.1 Multi-Core General-Purpose Processors (GPP) Architecture . . . . . . . . . . . . . . 82.2 Graphical Processing Unit (GPU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Load Balancing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

7

Due to computational requirements, CPU and GPU architectures have been evolving in different

directions, having different advantages and disadvantages. Notwithstanding, heterogeneous systems

composed of both CPUs and GPUs can take advantage both devices to accelerate the execution of a

program. In particular, the GPU architecture provides many cores, each substantially simpler than a

CPU core, trading off single-thread for multi-thread performance. The GPU achieves high-throughput

by hiding thread memory access latency with intensive arithmetic operations from other threads, and by

rapidly switching execution context between groups of threads. It should be noticed that context switch-

ing has very little cost in comparison to CPU threads because hundreds of thread contexts are stored

on-chip1. Conversely, state-of-the-art multi-core CPU architectures offer a few but highly complex cores,

using techniques to exploit Instruction Level Parallelism (ILP) and multiple levels of caches to accelerate

main memory access. This higher complexity results in increased area and power consumption, which

allows only a small set of cores to co-exist in a single die.

When targeting a CPU/GPU heterogeneous environment, the application must be carefully analyzed

and partitioned to efficiently take advantage of both devices. Code with intensive flow-control or low

threaded data and functional parallelism should be kept on the CPU, whereas arithmetic intensive large

data parallel code should be executed in the GPU. Furthermore, the applied partitioning should in

general minimize communication between the CPU and the GPU. To design an efficient work-load

partitioning, a load balancing solution may be devised. The OpenCL framework does dot offer any

intrinsic tools for dealing with work-load scheduling between compute devices, and as such one must

implement a balancing approach fit for the application at hand. At this respect, several authors have

studied work-load scheduling to complement standard heterogeneous frameworks, such as libraries for

the CUDA framework [10], [1], [7] and the Maestro library for the OpenCL framework [47]. Nonetheless,

other load balancing alternatives [9, 10, 12, 28, 42, 53] are discussed in Section 2.4.

In this chapter, an overview of both state-of-the-art CPU and GPU is presented, followed by an

introduction to the OpenCL programming framework, and a review of the literature on load balancing

solutions for heterogeneous platforms.

2.1 Multi-Core General-Purpose Processors (GPP) Architecture

State-of-the-art mainstream multi-core CPU architectures offer a few but highly complex CPU cores.

Very fast memory is available through the use of registers local to each core and access to the larger

but slower main memory is made via several levels of caches. In Figure 2.1, a typical example of a

multi-core CPU architecture with three layers of cache (2 private and 1 shared) is displayed. Several

hardware techniques are employed to accelerate single-threaded execution, such as increasing the clock

frequency through multi-stage hardware pipelining, resulting in higher instruction throughput. Further-

more, modern architectures exploit ILP using super-scalar and out-of-order instruction execution. While

the former allows executing instructions in parallel in the available functional units (in case of no data,

control or structural hazard), the latter enables switching the order of independent instructions to reduce

1For the case of the NVIDIA GK104/GK110 architectures, the maximum number of resident threads per Multi-processor is 2048 [39].

8

processor stalls. Program flow control overhead is mitigated by branch prediction hardware and ulti-

mately by allowing speculative execution. Latency caused by inevitable processor stalls may be further

hidden by hardware multi-threading, allowing simultaneous execution of different threads in the same

processor.

Multi-core CPU

Core

L1

L2

Core

L1

L2

Shared L3

Core

L1

L2

Core

L1

L2

Memory Interface

Figure 2.1: Example CPU with 4 cores and 3 levels of cache.

Higher hardware complexity results in increased area and power consumption per core, which is the

reason why multi-core CPUs only include a small number of cores relatively to GPUs. This means that

the type of parallelism that can be extracted from a multi-core CPU architecture is also more coarse

grained, typically leading to the application of the Multiple Instruction Multiple Data (MIMD) parallel

programming paradigm. Furthermore, Communication between threads in different CPU cores is much

more expensive than communication between threads in cores of the same thread-block in a GPU. In

modern CPUs, each core offers Single Instruction Multiple Data (SIMD) instructions that enable the

extraction of fine-grained parallelism (e.g. Intel SSE/AVX instructions), making the multi-core CPU a

very versatile parallel platform. Although it does not match the GPU in terms of floating point operations

per second for highly data parallel applications, it is more efficient for algorithms with complex control

flow or very coarse grained parallel structure.

2.2 Graphical Processing Unit (GPU)

Graphical Processing Units are designed to accelerate graphic computations. However, due to the

inherent complexity of designing highly efficient dedicated architectures with support for a large number

of operations, significant design changes have been made. Accordingly, GPU vendors started to intro-

duce programmable vertex and pixel shaders. Along the past years, the programmability support has

been substantially increased, allowing for General Purpose computing on Graphics Processing Units

(GPGPU). In the meanwhile, to facilitate programmability, both Nvidia and AMD released proprietary

GPU programming languages, respectively CUDA and CTM (although lately AMD has embraced the

OpenCL open standard, which is also supported in Nvidia GPUs, Intel CPUs and embedded GPUs,

and a multitude of other devices). To better understand the architectural differences between CPUs

and GPUs, in this section an overview of AMD’s Southern Islands and Nvidia’s Kepler architectures is

presented.

9

2.2.1 AMD and Nvidia Architectures

Figure 2.2 depicts the AMD Southern Islands (SI) GPU architecture (HD7000 family). This architec-

ture is composed of several Computing Units (CUs). Each CU has one scalar unit and 4 vector units

composed of an array of 16 processing elements (PEs) each. Local to each CU, there are also five banks

of vector and scalar General Purpose Registers (vGPR/sGPR) and Local Data Share (LDS) memory.

The instruction issue takes four cycles where the four 16-Processing Element (PE) arrays execute 64

work-items in total. The resulting 64 element vector is called a wavefront2. Processing elements within

a compute unit execute in lock-step, whereas compute units execute independently in respect to each

other. Lock-step execution might pose problems if work-items from the same wavefront fall on different

branch paths, in which case all paths must be executed serially, thus reducing efficiency. This is because

work-items from the same wavefront share the same Program Counter (PC). For the case of this device

architecture family, the four arrays of 16-PE execute code from different wavefronts.

Figure 2.2: Organization of the AMD Southern Islands GPU architecture.

The current state-of-the-art of NVIDIA device architecture is Kepler (Compute Capability (CC) 3.X).

An example device from this family is the GeForce GTX 680 GPU, which includes the Kepler chip GK104.

This particular GPU is composed of 4 Graphics Processing Clusters (GPCs) each with 2 Streaming

Multiprocessor (SMX), and 4 memory controllers. Each SMX is in turn composed of 192 CUDA cores

(roughly equivalent to the AMD ALUs presented earlier). Each SMX contains 4 warp schedulers that

dispatch two instructions per warp with active threads, every clock cycle. The warp of a GPU is a number

of threads representing the finest grain of instruction execution of a multi-processor. For the Tesla, Fermi

and Kepler device architectures, this number is 32, meaning that at least 32 threads must execute the

same instruction. A warp is roughly the equivalent of a wavefront in AMD hardware.

A thread is said to be active if it is on the warp’s current execution path, otherwise it is inactive.

When a threads in the same warp follow different execution paths, the warp is said to be diverging

(the same thing happens for wavefronts in AMD hardware). For example, in the case of a kernel with

2Other AMD graphic card families may have different wavefront sizes.

10

Kepler (GTX 680 – GK104)

L2 Cache

SMX SMXGPC SMX SMXGPC

SMX SMXGPC SMX SMXGPCM

EMC

MEM

C ME

MC

MEM

C

GigaThread Engine

Streaming Multiprocessor (SMX)

Warp Scheduler Warp Scheduler Warp Scheduler Warp Scheduler

Dispatch Dispatch Dispatch Dispatch Dispatch Dispatch Dispatch Dispatch

SF ULD/ST

SF ULD/ST

SF ULD/ST...

SF ULD/ST

SF ULD/ST

SF ULD/ST

...

SF ULD/ST SF ULD/ST

64kb L1 Cache / Shared Memory

Register File (65536 x 32bit)

Texture Cache

Uniform Cache

Cuda Core

INT FP

Dispatch port

Operand Collector

Result Queue

(192 per SMX)

Figure 2.3: GTX 680 device Architecture.

a branch where some threads fall on different branch sections, all sections must be executed serially,

reducing parallel efficiency. To eliminate this problem, the programmer must try to align branch results

with Warp/Wavefront boundaries, enforcing that threads from the same Warp/Wavefront always fetch

the same instructions. This might not be possible at all if the branching is not predictable, or change the

algorithm memory access patterns in such way it does not compensate the gains acquired by reducing

divergence. In older GPU devices (i.e. Nvidia’s CC 1.X), optimal memory access access patterns where

restricted to a fairly reduced set, but in CC 3.X the allowed patterns that still offer optimal performance

are much more relaxed [40].

2.3 OpenCL

As previously mentioned, several paradigms exist for programming both CPUs and GPUs. However,

unlike most alternatives, OpenCL is supported by several different platforms, such as GPUs from mul-

tiple vendors, multi-core CPUs, DSPs and FPGAs [46]. Also, OpenCL simplifies orchestrating multiple

devices in an heterogeneous environment and allows writing portable code between different architec-

tures. Thus, OpenCL has been chosen for performing the proposed work.

OpenCL is organized in a hierarchy of models [22]: Platform Model, Execution Model, Memory

Model and Programming Model. Each of these models is explained in the following sections. The

OpenCL framework includes the OpenCL compiler (OpenCL C), the OpenCL platform layer and OpenCL

11

Runtime. In this project, the newest available OpenCL standard was used for each device (OpenCL 1.1

for the considered Nvidia GPUs and OpenCL 1.2 for the Intel CPUs and AMD GPUs).

2.3.1 Platform Model

The Plaform Model defines how a program maps into the OpenCL platform, which is an abstract

hardware representation of the underlying device. As depicted in Figure 2.4, the platform model is

composed of a Host connected to one or multiple OpenCL devices. An OpenCL device is a collection of

CUs, which in turn are divided into one or more PE3, where the computation is done. The code that runs

on the host uses the OpenCL Runtime to interface with the OpenCL device, to which it may enqueue

synchronization commands, data or kernels. A kernel is a function written in OpenCL C, and can be

compiled before or during program execution. Within each CU, PEs can execute either in SIMD or

Single Program Multiple Data (SPMD) fashion. In the former, PEs execute in lock-step, whereas in the

latter PEs keep their own program counter and may follow independent execution paths.

Figure 2.4: OpenCL Platform Model [22].

2.3.2 Execution Model

An OpenCL program executes over an index space in two main components: host code running

on the host device and kernel code that runs on each OpenCL Device. Kernel instances are called

work-items and are further grouped into work-groups. Each work-item has a unique identifier in the

global index space and in the local index space (local to each work-group). Index spaces are called

NDRanges and can have 1,2 or 3 dimensions, thus, the local and global indices are 1,2 or 3 dimensional

vectors. In Figure 2.5, an example of this organization is depicted for 2 dimensions. For GPU devices

the best performance should be attained when the work-group-size is an integer multiple of the warp-

size (NVIDIA) or the wavefront-size (AMD) because this is the minimum execution granularity supported.

Failing to meet this criteria will cause running work-group-size%warp-size useless threads.

To support different devices with different thread management systems, OpenCL employs a relaxed

synchronization and memory consistency model. This way, execution of work-items is not guaranteed

to follow any specific order. Nevertheless, explicit work-group barrier instructions can be placed in the

3This structure (and naming) closely resembles the one for AMD devices, presented in section 2.2.1.

12

Figure 2.5: Partitioning of work-items into work-groups.

kernel code to ensure execution synchronization between work-items of the same work-group. Synchro-

nization of work-items belonging to different work-groups is not possible during the same kernel launch,

a behaviour depicted in Figure 2.6. Memory consistency details are explored in Section 2.3.3.

BB

B

B

Instruction 1Instruction 2WG BarrierInstruction 3

11

1

1

22

2

2

Kernel Work-group (WG) 0

...

3

33

3

Kernel Launch 0

Work-Group Synchronization

Memory consistency inside each WG

Kernel Launch 1

...

B

BB

B

11

1

1

2

2

2

2

Work-group (WG) N

33

3

3 Global Synchronization

Memory Consistency between WG

Figure 2.6: Partitioning of work-items into work-groups.

Another important concept is an OpenCL Context, which includes a collection of OpenCL Devices,

a set of kernels, a set of Programs (source and compiled binary that implement the kernels) and a set

of Memory Objects. Associated with a Context is one or more Command Queues, via which the host

enqueues execution, memory and synchronization commands to the OpenCL Devices. This queue may

be set as in-order or out-of-order, which defines if the order by which commands are enqueued must be

13

respected or not.

2.3.3 Memory Model

The OpenCL standard defines four memory region types, each having different rules for access and

allocation:

(i) Global Memory: This memory reagion is accessible by all work-items for read/write operations.

Furthermore, the OpenCL-Host has read/write access and is responsible for dynamic memory al-

location. This memory may either be cached or not, depending on the target architecture. AMD

SI-GPU and newer NVIDIA devices, for example, have global memory caches accessible by each

CU. Global memory read/write consistency between work-items of the same work-group is only

guaranteed if they encounter a global work-group barrier. Conversely, there is no guarantee of

memory consistency across different work-groups, during the execution of a kernel. This behavior

is depicted in Figure 2.6.

(ii) Constant Memory: Memory Accessible by all work-items for read operations, remaining constant

during the execution of a kernel. Like the Global Memory, the Host has read/write access and is

responsible for memory allocation (Dynamic). Constant memory is usually cachable (e.g., in the

Kepler architecture it is implemented as a configurable fraction of the L1 cache) and typically has a

lower average access latency in respect to Global Memory.

(iii) Local Memory: This memory region is shared by work-items of the same work-group for read/write

operations. Allocation can be done either statically by a kernel or dynamically by the Host (although

the Host cannot access this memory region). It is usually implemented as dedicated memory in

each CU, but in some devices it can also be mapped into Global Memory. In AMD SI-GPU, this

memory is mapped into LDS (see Figure 2.2), whereas in Nvidia’s Kepler architecture it is mapped

into the Shared Memory (see Figure 2.3). Local memory is only consistent between work-items of

the same work-group after they encounter a local work-group barrier, as depicted in Figure 2.6.

(iv) Private Memory: Memory region private to each work-item, for read/write access. Neither the Host

nor other work-items can access this memory. It must be statically allocated in the kernel and is

usually implemented as registers in each CU.

2.3.4 Programming Model

The OpenCL standard supports two programming models: Data Parallel and Task Parallel. In the

Data Parallel programming model, parallelism is exploited by parallel executing the same set of oper-

ations over a large collection of data. Considering computation over data in an array, each work-item

executes an instance of the kernel in one array index (strictly data parallel model) or in more (relaxed

data parallel model). Hierarchical partitioning of work-items into work-groups can be defined explicitly

by the programmer or implicitly by the OpenCL implementation.

14

Conversely, in the Task Parallel programming model, a single instance of the kernel is executed,

where parallelism can be extracted by using vector types supported by the device or by enqueueing

multiple tasks (different kernels) to the Device. Intel SSE/AVX/AVX2 vector instructions, for example,

can be inferred by writing operations with OpenCL vector types (e.g. float4, int4).

2.3.5 OpenCL Runtime Parametrization

To account for the existing heterogeneity, the OpenCL Host can query the underlying platform through

the OpenCL library for the available devices and their specific characteristics. As an example, the pre-

ferred elementary work-group size of each device can be queried, typically returning the warp-size (32)

for Nvidia GPUs, and the wavefront size (64) for AMD GPUs. For Intel OpenCL compatible CPUs, this

number is usually equal (or higher) than 64 [24]. According to the results obtained from this device dis-

covery process, different work-group partitioning schemes may be used for each device (e.g., number of

work-items, work-group-size, amount of data per work-item, etc). Furthermore, to enable inter-platform

portability, the OpenCL framework offers the possibility of compiling the developed kernels in runtime,

allowing different compilation flags or kernel versions to be chosen, according to the target platform.

2.4 Load Balancing Techniques

When considering the trade-offs between the multi-core CPU and GPU architectures, it makes sense

to attempt a simultaneous exploitation of these computational platforms, by scheduling the workload to

the device best suited for each particular task. Figure 2.7 depicts an example network of heterogeneous

computing nodes, each comprised by multiple CPU cores and one or more specialized accelerators. In

particular, these accelerators can be GPUs with very different compute capabilities, or even other types

of hardware platforms (e.g. FPGAs, DSPs). In such an heterogeneous environment, HPC applications

Heterogeneous Node

GPU A

GPU B

Other (e.g. FPGA)

Node

Node

Node

Multi-Core CPU Node

Node

Node

...

...

HeterogeneousNetwork

Figure 2.7: Example of an heterogeneous network composed by several compute nodes, each com-prised by multiple CPU cores and one or more specialized accelerators.

frequently call for load balancing mechanisms to distribute the workload among the available processing

nodes. A simple and insightful way of posing a typical load balancing problem is the following: Consider

15

a cluster of p processing nodes; let ti(dki ) be the time taken by node i to compute over the assigned

data dki at iteration k, where i ∈ [0, p − 1[. The objective is that at some iteration k = b, all devices take

the same time to compute the assigned load, yielding dbi = dbj for every {i, j} node pair. Specialized

algorithms may take into account other performance metrics, such as consumed power [29] or inter-

node communication latency[28]. Furthermore, while some publications aim to present generic load

balancing methods, others focus in offering a solution for specific applications or scientific fields.

Load balancing algorithms found in the literature can typically be classified according to some fun-

damental characteristics [12]. First of all, the load balancing solution can either be Static or Dynamic.

Static [28] implementations evaluate the characteristics of the application and the target hardware plat-

form (either at compile-time or run-time) and make the workload distribution based on these data. For

example, in [28] the authors introduce an algorithm to find a subset of computing nodes in a complex

network that form an optimal virtual ring network, classifying candidate nodes by considering the pro-

cessing capabilities of each one and the bandwidth of the respective communication links. Conversely,

dynamic [10, 42] load balancing solutions take into account one or more performance metrics (e.g. time,

power, accuracy) measured in run-time and dynamically modify the workload distribution to best fit the

heterogeneous platform. For example, in [42], a dynamic load balancing algorithm is devised for the

Adaptive Fast Multipole Method (AFMM) method, which is a solver for n-body problems (e.g. Colliding

Galaxies, Fluid Dynamics). In order to balance the load in a cluster composed by 10 CPUs and 4 GPUs,

an adaptive decomposition of the particle space is employed, and is modified dynamically according to

a performance model that predicts the performance of future iterations using previous execution time

measurements.

Secondly, load balancing algorithms can either be Centralized [9, 10, 12] or Decentralized [28, 53].

The former concentrate load balancing decisions in one monitoring node that schedules the work among

the cluster, whereas the later rely on local decisions made on each computing node (possibly using

information from neighbour nodes) to distribute the workload among them. Furthermore, centralized load

balancing algorithms can be further classified as either Task-Queue [10] or Predicting-The-Future [9,

12, 42]. Task-queue algorithms rely on partitioning the work-load into several smaller tasks, which are

continually fetched by the computing nodes. Although they are a relatively simple solution to implement,

a high speed communication link is required between the node managing the task-queue and every

other computing node, since tasks are usually required to be fetched frequently (to ensure a fine-grained

balancing). Conversely, predicting-the-future approaches schedule the work depending on performance

measurements of past iterations. If the balancing solution is well implemented (and the target algorithm

allows it), these approaches can converge to stabilized work-load distribution, and cease to require

intensive inter-node communication.

Considering the importance of load balancing methods for scheduling the workload among hetero-

geneous devices, two balancing solutions were employed in the parallelization solution devised in this

dissertation. The first is a task-queue algorithm with a distributed balancing decision, whereas the sec-

ond is a centralized predicting-the-future dynamic load balancing approach. Details about these two

algorithms will be presented in Chapter 6.2 and Chapter 5.

16

2.5 Summary

In this chapter, an overview of both state-of-the-art CPU and GPU hardware was presented, and the

architectural differences between the two platform families were discussed. Following, an overview of

the OpenCL programming framework was introduced, highlighting the structure of the framework, and

the opportunities it offers to exploit a wide range of accelerators. The advantages of exploiting hetero-

geneous platforms comprised of CPU and GPU devices and the wide availability of these computational

resources among scientific research groups, led to the choice of targeting these type of systems. At this

respect, a brief review of the literature on load balancing solutions was discussed, presenting typical so-

lutions for addressing the problem of efficiently scheduling the workload in an heterogeneous computing

environment. Further details about the particular load balancing algorithms employed in this dissertation

will be presented in Chapter 4 and Chapter 5.

17

CHAPTER 3

PERTURBATIVE MONTE CARLOQM/MM

Contents3.1 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Computational Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Data Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

19

The Perturbative Monte Carlo QM/MM algorithm is a molecular simulation procedure designed to

study mixed QM/MM simulations. These simulations usually consider a circumscribed region of interest

(often referred to as active site) and an immersive environment. It takes a QM/MM system as input and

outputs other chemically viable configurations of the same system, sampled with the Metropolis Monte

Carlo rule [32]. As introduced earlier in this dissertation, the Metropolis MC method samples the sys-

tem in the ensemble space, rather than following a time coordinate. With this method, a sequence of

random configurations is obtained on the basis of Maxwell-Boltzmann statistics, by performing random

movements at each time frame and by evaluating the corresponding change of the system energy. The

underlying methods for calculating this energy can be traditional MM, QM or mixed QM/MM methods.

MM approaches represent atoms and molecules through ball and spring models, with heavily param-

eterized functions to describe their interactions. Alternatively, QM approaches explicitly simulate the

electrons, at a cost of a much higher computational burden. An alternative solution (which is employed

in the algorithm herein studied) consists of a mixed QM/MM approach, which combines the strengths of

each method. In this case, a small active region is simulated with QM, while the remaining environment

is represented by classical MM. Achieving a comprehensive understanding of the algorithm structure

represents a fundamental step to devise the best parallelization approach. Accordingly, a brief char-

acterization of the QM/MM simulations under study, together with an overview of the PMC method, is

presented in this chapter. Following, a computational complexity analysis and the strategy that will be

applied for the algorithm parallelization is introduced. Finally, the related work on accelerating molecular

simulation algorithms is discussed.

3.1 Algorithm Description

For the purpose of describing the PMC method, a chemical solution composed by a solute (region of

interest) and a solvent (environment) will be herein considered as an example. Accordingly, Figure 3.1

depicts a schematic of such a system, comprised by a single solute molecule (molecule C), which is

treated at the QM level, and two solvent molecules (molecules A and B), treated at the MM level. Then,

by applying a Metropolis MC step, one of these molecules will be randomly picked, translated and

rotated to generate a new structure. This last MC step will be either accepted or rejected, according to

the resulting energy change. MC steps are accepted if the energy of the obtained configuration is lower

than the previous configuration reference (which is the last accepted configuration) or accepted with a

probability1 e∆E

KBT if the energy of the system has risen.

As depicted in Figure 3.1, the energy change is computed by considering two types of interactions

with the changed molecule (e.g. molecule A), resulting in either QM/MM energy terms or pure MM

terms. The QM/MM terms account for the interaction with the QM solute (molecule C), whereas the MM

energy terms account for the interaction with every other solvent molecule (in this case, just molecule

B). Furthermore, for both levels of theory (QM/MM or pure MM), Coulomb and Van der Waals (vdW)

contributions have to be considered.

1Boltzmann distribution, where KB stands for the Boltzmann factor and T for temperature.

20

AB

C

A B

C

B

C

A

QM/MM InteractionMM InteractionMonte Carlo Step

Before Monte Carlo Step After Monte Carlo Step

QM Region

MM Region

QM/MM System

Figure 3.1: A system composed of one QM molecule (C) and two MM solvent molecules (A and B). Foreach MC step, the difference in energy between the molecule moved (A) and every other molecule hasto be computed, but at different levels of theory.

Monte Carlo Step

Coulomb Nuclei QMMMVDW QMMMVDW MMCoulomb MM

+Update

Reference

Output Result

yAccept ?

QMupdate

PMCcycle

PMCcycle

QMupdate

PMCcycle

Coulomb Grid QMMM

Δ EMMC Δ EMM

vdW Δ EQM/MMvdW Δ EQM/MM

C,nuclei Δ EQM/MMC,grid

Δ E

PMC Cycle: KCycle steps PMC : K PMC iterations

Figure 3.2: Perturbative Monte Carlo QM/MM with focus on the simulation bottleneck (PMC cycle, right).Arrows represent data dependencies.

Furthermore, Figure 3.2 illustrates the dataflow of the target PMC method. In each PMC Cycle

(right), KCycle Monte Carlo steps of the MM subsystem are executed, while keeping the QM region

static. In another process, the electronic density of the QM region is updated (QM Update) by using

MOLPRO[55], and the result is subsequently used in the next PMC Cycle. As described earlier, the

system energy variation has to be computed at each MC step (henceforth referred to as a PMC Cycle

step), given by the expression:

∆E = ∆ECMM + ∆EvdW

MM + ∆EvdWQM/MM + ∆EC,nuclei

QM/MM + ∆EC,gridQM/MM (3.1)

where each partial ∆E term corresponds to an energy contribution computed in a particular PMC Cy-

cle procedure (see Figure 3.2). As previously introduced, each PMC Cycle step consists in selecting,

translating and rotating a random MM molecule, computing ∆E (See Equation 3.1), and checking the

current QM/MM system configuration for acceptance. In order to store the obtained results, every Foutput

iterations (see Table 3.1), the current QM/MM configuration is written in an output file. Moreover, despite

being a good example for illustrating the algorithm, Figure 3.1 only depicts a very small system. Con-

versely, a more general QM/MM run will have a much higher number of molecules, and is characterized

21

Table 3.1: QM/MM Run Characterization, together with the typical parameter range for the benchmarksconsidered in this work. For the case of homogeneous solvents, the Z

(i)MM parameter (concerning

molecule i) will be the same for every MM molecule.

Input QM/MM SystemParameter Description Typical RangeNQM Number of QM grid points [105, 107]

NMM Number of MM Molecules 103

ZQM Number of atoms in the QM region [10, 102]

Z(i)MM Number of atoms per MM molecule [1, 10]

AMM Number of MM atoms [103, 104]

Run ParametersKPMC Number of PMC iterations [104, 107]

KCycle Number of PMC cycle steps (per PMC iteration) [10, 103]

Foutput Output write frequency (procedure Output Result, Figure 3.2) [103, 104]

by the parameters introduced in Table 3.1.

The Coulomb QM/MM energy computation is of particular interest since it is the most computational

intensive calculation in each PMC Cycle. This energy contribution is accounted for by two distinct terms,

∆EC,nucleiQM/MM and ∆E

C,gridQM/MM. The former accounts for the interaction with the atoms of the QM molecule

represented by classic nuclei centred charges, whereas the latter accounts for the interaction with the

QM electronic wave represented by a grid of point charges (henceforth referred to as grid). Between

the two, the ∆EC,gridQM/MM term is largely more computational intensive (see Section 3.2 for more details)

and corresponds to a discretization of the integral shown in Equation 3.2, where ZMM and NQM follow

the definition given in Table 3.1, ρ(.) is the electronic density function, q the charge and r the distance

between the changed molecule and each grid point.

∆EC,gridQM/MM =

ZMM∑j

∫ρ(r)

qjri,j

dr GRID−−−−→ZMM∑

j

NQM∑i

qiqjri,j

(3.2)

The pseudo-code for the Coulomb Grid QM/MM energy computation (∆EC,gridQM/MM) is presented in Algo-

rithm 1. As shown, for each {atom, grid point} pair (considering the atoms of the displaced molecule),

the Coulomb potential is computed. Furthermore, since periodic QM/MM systems (defined by a repeat-

able simulation box) are herein considered, the spacial range of the considered electrostatic interactions

(i.e. Coulomb, vdW) has to be limited by a cutoff distance in space (rc). Accordingly, shifted potentials

(Vshift) [16] are used in the ∆EC,gridQM/MM interaction terms

Vshift =

{1r −

1rc

+ 1r2c(r − rc) r < rc

0 r ≥ rc(3.3)

affecting each term differently, depending on to the distance between each {atom, grid point} pair (r),

and completely disregarding (set to 0) the interaction whenever r > rc. The usage of shifted potencials

can be observed in Algorithm 1, resulting in four possible space regions depending on the distance

between the considered grid point and both the old and the new set of coordinates of each atom of the

displaced molecule. Hence, four slightly different energy expressions (resulting from the application of

22

Vshift) may be computed. As discussed further in this dissertation, the procedure presented in Algorithm 1

will be one of the main targets of parallelization.

Algorithm 1 Coulomb Grid QM/MM energy (∆EC,gridQM/MM). See Table 3.1 for parameter definitions.

Define: atom := {position = {x, y, z}, chemical params = {σ, ε, q}}Init: Energy = 0.0Init: rc → Coulomb cutoff (run parameter)

1: for each atom i in changed molecule do [ Z(chmol)MM cycles ]

2: for each point j in charge grid do [ NQM cycles ]

3: rold = distance(i, j) in reference system4: rnew = distance(i, j) in new system5: qs = −qi × qj6: if rnew < rc and rold < rc then

7: Energy += qs× ( 1rnew

− 1rold

+ 1r2c

(rnew − rold))

8: else if rnew < rc and rold ≥ rc then

9: Energy += qs× ( 1rnew

− 1rc

+ 1r2c

(rnew − rc))

10: else if rold < rc then

11: Energy −= qs× ( 1rold− 1rc

+ 1r2c

(rold − rc))

12: end if13: end for14: end for

3.2 Computational Complexity Analysis

The computational complexity of the PMC QM/MM depends on the complexity of the program proce-

dures that comprise the PMC Cycle (see Figure 3.2). The Monte Carlo Step procedure, which consists

in rotating and translating a random molecule (henceforth referred to as chmol), has a complexity pro-

portional to the size of that MM molecule

O(Monte Carlo Step) = Z(chmol)MM (3.4)

usually having a very light execution time footprint. On the other hand, the complexity of the Coulomb

Grid QM/MM procedure (Algorithm 1) is proportional to the product of the size of chmol by the number

of grid points

O(Coulomb Grid QM/MM) = Z(chmol)MM ×NQM (3.5)

which will usually be the most time consuming procedure, since NQM is typically a big number. The

other two Coulomb computations have identical algorithm structures, although the involved data differs.

Coulomb Nuclei QM/MM uses nuclei centred point charges instead of the electronic grid, and thus it’s

complexity is given by:

O(Coulomb Nuclei QM/MM) = Z(chmol)MM × ZQM (3.6)

yielding a much lower complexity in comparison with Coulomb Grid QM/MM. On the other hand, Coulomb

MM computes the interaction between each atom of chmol and each atom of every other MM molecule.

Hence, it’s complexity is given by:

O(Coulomb MM) = Z(chmol)MM ×

NMM∑i

Z(i)MM (3.7)

23

which would be simplified to (Z(chmol)MM )2 × NMM , for the case of homogeneous solvents. The total

number of MM atoms (AMM ) may also be used in this text as the complexity variable, considering that

AMM =∑NMM

i Z(i)MM , for either homogeneous or heterogeneous solvents.

The vdW energy calculations have a completely different energy expression, as presented in Algo-

rithm 2, which shows the pseudo-code for vdW MM . Likewise, the vdW QM/MM procedure shares

an identical structure, although it loops over the QM nuclei centred charges instead of the MM atoms.

Similarly to the Coulomb computations, the vdW procedures have a nested for-loop structure with four

cutoff branches. Thus, the complexity of the vdW procedures is identical to their Coulomb counterparts,

yielding the following expressions:

O(VDW MM) = Z(chmol)MM ×

NMM∑i

Z(i)MM (3.8)

O(VDW QMMM) = Z(chmol)MM × ZQM (3.9)

Finally, each PMC Cycle step terminates with an update of the current system reference and output

writing. The reference update complexity is proportional to the size of the changed molecule, whereas

the output saving is proportional to the total number of MM atoms over the writing frequency:

O(Update Reference) = Z(chmol)MM (3.10)

O(Output XYZ) =

∑NMM

i Z(i)MM

Foutput(3.11)

Having in mind the typical magnitude of the QM/MM parameters (see Table 3.1), one can deduce

that the most computational intensive procedure is the Coulomb Grid QM/MM. This will be taken into

consideration when parallelizing the PMC program procedures. By accounting for all the PMC Cycle

procedures, the complexity of one PMC Cycle step results in:

O(PMC Cycle) = Z(chmol)MM × (NQM + ZQM +AMM ) +

AMM

Foutput(3.12)

by recalling that AMM =∑NMM

i Z(i)MM . Considering the typical ranges for these parameters (see Ta-

ble 3.1), one can observe that the leading term will be Z(chmol)MM ×NQM . In particular, the ×NQM param-

eter will have the heaviest footprint on the resulting complexity.

3.3 Data Dependencies

The PMC Cycle operates over three main data structures, which are depicted in Figure 3.3. Firstly,

the changed molecule (chmol), which is composed by Z(chmol)MM atoms, each represented by three di-

mensional Cartesian coordinates x, y, z and chemical constants σ, ε, q. Secondly, the QM grid, which is

composed by NQM point charges, each also represented by Cartesian coordinates and a charge2 (q).

Finally, the MM lattice, which comprises all MM molecules, including the chmol data before the MC step

takes place, and the QM molecule represented with classical MM nuclei (ZQM atoms).

2In this case, the charge is not constant because it is modified (alongside with the coordinates) by the QMUpdate process.

24

Algorithm 2 VDW MM energy (∆EvdWMM ). See Table 3.1 for parameter definitions.

Define: atom := {position = {x, y, z}, chemical params = {σ, ε, q}}Init: Energy = 0.0Init: rc → van der Waals cutoff (run parameter)

1: for each atom i in changed molecule do [ Z(chmol)MM cycles ]

2: for each atom j in every other MM molecule do [∑NMMj Z

(j)MM = AMM cycles ]

3: rold = distance(i, j) in reference system4: rnew = distance(i, j) in new system5: if rnew < rc and rold < rc then

6: Energy +=√εi × εj ×

((σi×σjr2new

)6 − (σi×σjr2old

)6 − (σi×σjr2new

)3 + (σi×σjr2old

)3)

7: else if rnew < rc and rold ≥ rc then


((σi×σjr2new

)6 − (σi×σjr2new

)3)

9: else if rold < rc then


((σi×σjr2old

)6 − (σi×σjr2old

)3)

11: end if12: end for13: end for

(N MM×ZMM+ZQM )×{ x , y , z ,σ ,ϵ , q }

NQM×{ x , y , z , q }

ZMM×{ x , y , z ,σ ,ϵ , q }MC

constants

chmol

grid

lattice

Figure 3.3: Main data structures used in the PMC Cycle. Refer to Table 3.1 for parameter definitions.

Figure 3.4 shows the data dependencies of each process in the PMC Cycle. In particular, the data

structure corresponding to the changed molecule (chmol) is written by the Monte Carlo Step and sub-

sequently read by all the energy calculation procedures, which compute their respective ∆E energy

terms to be processed by the Decide & Update procedure. Then, if the step under consideration is

accepted, the lattice corresponding to the MM Region (see Figure 3.1) is updated with the tested chmol

configuration, and a new Monte Carlo Step may take place. Unlike the other data structures, the grid

corresponding to the QM Region (see Figure 3.1) is not modified within the PMC Cycle. Instead, it is up-

dated by the QM Update process. Hence, considering the described data dependencies within the PMC

Cycle, it is observed that the energy contribution procedures can be executed in parallel with respect

to each other. Furthermore, even the energy calculations are amenable to parallelism, as each of them

can be mapped to a parallel reduction structure. For the particular case of the Coulomb Grid QM/MM

procedure, this can be verified by inspecting Algorithm 1, although the other energy calculations share

the same structure, apart from the energy expression and the involved data (e.g., see Algorithm 2).

Having this in mind, the PMC Cycle is the main target of study and parallelization in this work. At

this respect, several OpenCL kernels were devised to extract the available parallelism in the PMC Cycle

procedures, as well as a capable Host-side management framework to schedule the work among the

25

Monte Carlo Step

Coulomb Nuclei QMMMVDW QMMMVDW MMCoulomb MM Coulomb Grid QMMM

grid

lattice

chmol

ΔE

Decide & Update

Data

Procedure

write read

Legend

Figure 3.4: Data dependencies within the PMC Cycle. The VDW QMMM and Coulomb Nuclei QMMMprocesses only read the atoms that are part of the QM molecule, not the whole lattice.

available computational resources. The internal implementation of the QM Update will be kept mostly

unchanged3, apart from simple add-ons to accelerate inter-process communication. Nevertheless, a

scalable multiple Markov chain solution, which exploits parallelism in the MC state-space sampling, was

designed to accelerate the QM Update procedure. Chapters 4 and 5 discuss the devised solution in

detail.

3.4 Related Work

Due to the computational complexity of molecular simulation procedures, there has been substantial

research work. The literature describing this work can be grouped by: i) the nature of the employed

sampling, ii) the type of theory used for the energy calculations and iii) the chemical application for

which they have been tuned to. The employed sampling is usually performed in time (MD) or in state-

space (MC) and the energy interactions may consider pure QM, pure MM and mixed QM/MM terms.

Furthermore, for the same state-space sampling strategy, several variants may be considered. For

the case of MC sampling, this includes (among other possible approaches) the Diffusion Monte Carlo

(DMC) [34], the Variational Monte Carlo (VMC) and the PMC [51]. Finally, the application for which the

algorithm has been tuned to, may vary greatly, and this is the main reason for which the performance

gains attained in the parallelization of the algorithms in this field can seldom be compared to each other.

MD is a popular approach to studying a wide range of dynamical properties, and it led to several ac-

celeration works dating from the earlier times of GPGPU [17, 48] to more recent publications [31, 38, 44].

On the other hand, methods based on MC sampling allow simulating systems with longer timescales,

and several works have also accelerated these algorithms by following GPGPU approaches [2, 3, 13,

23, 30, 52]. Our work falls into the latter category (MC) and therefore we shall present a more detailed

review of those works.

The work in [2] presents a GPGPU solution for Quantum Monte Carlo (QMC), achieving up to 30×

speed-up in individual kernels and up to 6× speed-up in the overall execution. The QMC variety that is

considered by such research is based on DMC, unlike the PMC approach followed in our work. They

3The MOLPRO program suite is a closed source commercial tool and is performing extremely complex calcula-tions in the QM Update procedure. Besides not having the main code available (aside from user scripts), it is notthe bottleneck of the PMC QM/MM, and thus optimizing it is out of the scope of this work.

26

employ a scheme for simultaneous state-space exploration (each chain called a walker) by using a

scheme similar to the multiple Markov chain approach that is herein adopted. However, they emphasise

on exploiting a high amount of parallelism at the walker level (up to 16 simultaneous walker evaluations

on the GPU), whereas we focus on exploring the finer-grain level of parallelism on each chain (which are

heavy enough to keep the GPU busy, in our case), and manage chain-level parallelism with fewer chains

per GPU. We took this approach since spawning a very large number of chains on the same GPU would

be unfeasible for the case of the PMC method, since each chain requires to compute not only the MC

step trials (in this case, the PMC Cycle procedures), but also the intrinsically serial QM Update process.

In [13], the authors discuss a GPGPU parallel approach to continuum QMC, by considering the

DMC. They target Nvidia GPUs by using the CUDA framework, and MPI to schedule the work among

computational clusters, exploiting walker-level and data-level parallelism, and achieving full-application

speed-ups from 10× to 15× in respect to a quad-core Xeon CPU implementation. However, unlike the

work herein described, they do not target QM/MM systems, focusing only on QM applications.

The work described in [52] uses MC sampling (based on Variational Monte Carlo) and targets

QM/MM systems, by exploiting computational clusters composed of heterogeneous nodes. Accord-

ingly, since their performance bottleneck is on the calculation of the electrostatic potential, they use

GPUs to handle the bottleneck code and CPUs for the remaining procedures, obtaining a speed-up of

up to 23.6× versus a single-core CPU. The adopted GPGPU framework is CUDA, and an MPI solution

is proved scalable up to 4 CPU cores. They do not report any explicit load balancing solution nor target

the simultaneous exploitation of heterogeneous GPU platforms, contrary to the work herein presented.

In [3], the authors describe a CUDA GPGPU implementation for many-particle simulations using MC

sampling. They partition the particle set in several cells and apply many MC steps in parallel, known

to not interfere with each other. They do not target QM/MM systems. Instead, tests are performed

for a ”hard disk” system (two-dimensional particles which cannot overlap), and the considered particle

interactions are the physical collisions. Unlike physical collisions, the electrostatic potentials considered

in our work have a much higher range, and as such the computed energy terms at each MC step

depends on a much larger number of their neighbour molecules (the potencial cuttofs are about half of

the simulation box). Therefore, such scheme would not be effective to solve the problem that is herein

considered, as most MC steps would interfere with each other. The work presented in [23] also describes

a parallel approach to particle MC simulations using CUDA, without any emphasis on QM/MM systems.

Finally, the work in [38] targets QM/MM simulations, although time sampling (MD) is used instead,

and a special focus is given to accelerating the QM grid generation, achieving up to 30× speed-up. This

contrastes to what happens in the PMC, where the bottleneck is found in the QM/MM electrostatics (the

PMC Cycle), which is significantly accelerated by our implementation.

Before concluding, it is worth recalling that direct performance comparisons are difficult to handle

in this field, and very few authors do it in the literature. Furthermore, very few have considered the

usage of heterogeneous architectures for hybrid QM/MM simulations, whilst using MC sampling. Our

solution efficiently takes advantage of the hybrid nature of QM/MM simulations and the MC state-space

exploration, unlike typical pure QM or MM approaches.

27

Most existing works adopted CUDA as the programming framework, being constrained to Nvidia

GPUs. To circumvent this limitation, other frameworks have been developed to ease the programming

of non-conventional architectures, such as StarPU [5] and OpenCL [22]. Due to its simpler means to

orchestrate multiple devices in an heterogeneous environment and to write portable code between differ-

ent architectures, the later was used in this work. Moreover, by allowing an easy extension with the MPI

framework, the proposed approach still leaves opened the possibility to exploit further performance scal-

abilities at the chain-level, since the most challenging fine-grained part consisting on the parallelization

of the PMC Cycle was already overcomed.

3.5 Summary

In this chapter, a brief characterization of the QM/MM simulations under study, together with an

overview of the PMC method, was presented. Then, a computational complexity analysis of the PMC Cy-

cle procedures was conducted, revealing the computational bottlenecks and concluding that the Coulomb

Grid QM/MM procedure is the most computational intensive step of the PMC Cycle. In particular, the

dominating term was shown to be the number of QM grid points (NQM ). Following, a description of the

data dependencies present in the PMC Cycle was presented, laying out the basis for the paralleliza-

tion strategy presented in the following Chapters. Finally, the related work on accelerating molecular

simulation algorithms was discussed and commented on. It was concluded that despite existing a vast

diversity of research done on this particular field of application, there are still novel contributions from

this dissertation. In particular, heterogeneous architectures were seldom considered, and the usage

of the multi-platform multi-paradigm OpenCL framework, as well as the targeting of the particular PMC

QM/MM method, are among the novel contributions of the work herein presented.

28

CHAPTER 4

PARALLEL HETEROGENEOUSSOLUTION

Contents4.1 Original PMC QM/MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Exploiting Markov Chain Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3 Parallelization Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.4 Data Structure Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

29

The objective of this work is to accelerate the execution of the PMC QM/MM algorithm, by exploiting

heterogeneous platforms composed by a multi-core CPU and one or more OpenCL accelerators (e.g.,

GPUs). In this Chapter, a top-level description of the devised parallel solution is introduced. Firstly, the

original PMC QM/MM approach (developed at the Free Floater Research Group) is briefly described.

Then, an introduction on applying Markov chain theory to MC simulations is herein presented, focusing

on the particular case of the PMC QM/MM simulation method. After this, the overall structure of the

parallelization strategy is laid out, discussing details about the developed OpenCL Host program and

the work-flow of the complete application. Then, a coarse-level load balancing solution to schedule the

Markov-Chain workload among heterogeneous devices is described. Finally, a few preliminary data-

structure optimizations are discussed. A detailed description of the developed OpenCL Kernels, as well

as a second load balancing algorithm for scheduling finer-grained workloads, is presented in Chapter 5.

4.1 Original PMC QM/MM

The starting point for the parallelization study developed in this dissertation was the original PMC

QM/MM algorithm implementation, provided by the Free Floater Research Group - Computational Chem-

istry and Biochemistry, Institut fur Physikalische Chemie, Georg-August-Universitat Gottingen. This orig-

inal approach was designed to run on a single-core CPU, by executing two interleaving UNIX processes:

the PMC Cycle and the QM Update, which communicated via a file (hard-disk I/O) between PMC itera-

tions. The PMC Cycle was developed at the Free Floater Research Group and the complete C++ source

was made available for this work. The QM Update is comprised by a few FORTRAN user-scripts (also

developed at the Free Floater Research Group) which call MOLPRO routines. In contrast with the other

program parts, the MOLPRO program suite is a closed source commercial tool, which is performing

extremely complex calculations in the QM Update procedure. Since the code is not available for opti-

mization and since this procedure is not the bottleneck of the original PMC QM/MM (as it will be shown

in Figure 6.2), optimizing it was deemed to be out of the scope of this work. Nevertheless, a method

for executing several instances of the QM Update in parallel will be introduced in this dissertation, by

exploring multiple Markov chain parallelism, a topic discussed in the following section.

4.2 Exploiting Markov Chain Parallelism

In the context of the Metropolis MC sampling method [32], a sequence of accepted steps is called a

Markov Chain [19, 20]. For the particular case of the PMC QM/MM algorithm, a Markov Chain represents

a sequence of accepted QM/MM system configurations, which are generated by independent instances

of the PMC (PMC Cycle + QM Update). As depicted in Figure 4.1, several independent MC state-space

exploration chains may coexist, each generating an independent sampling of the conformal space of the

target QM/MM system.

The exploitation of multiple Markov chains in general purpose MC methods has been addressed in

several works [6, 45], and even in the context of a CPU-GPU environment [57]. In the next subsections,

details for the particular case of exploring Markov chain parallelism in the PMC QM/MM method are pre-

30

...

...

✔

✔ ...

✔

...

✔

Chain 0 Chain 1

Legend

Accepted Configuration

Changed Molecule (chmol)

✔

✔

Chain 0Output

Chain 1OutputIndependent state-space exploration

and output generation

Figure 4.1: Independent MC state-space exploration chains (illustrative example for 2 chains), eachgenerating an independent sampling of the conformal space of the target QM/MM system.

sented. To keep the devised approach as general as possible, and by considering the vast diversity of

computational platforms that are commonly available today, two distinct QM/MM simulation approaches

deserve particular attention: Running less Markov Chains than the available number of OpenCL accel-

erators, and the opposite case. In particular, the former is typically found in many-node computational

clusters, since these hardware platforms may have more computing nodes than the desired number of

independent Markov chains one wishes to spawn, in order to achieve the desired statistical properties

of the MC sampling. To address this case, specially tailored load balancing approaches are required,

since data from the same Markov chain exploration context has to be shared between several (possibly

heterogeneous) nodes. The approach for balancing the work of a single Markov chain among several

devices is presented in Chapter 5.

4.2.1 Multiple Markov Chain Parallelism

As introduced earlier, several MC state-space instances can be sampled by running several Markov

chains in parallel, thus allowing the simultaneous execution of the respective PMC Cycles. Furthermore,

this technique also allows executing the respective QM Update processes for the several chains in par-

allel. Since the PMC Cycle is the bottleneck of the PMC QM/MM method (as will be shown in Figure 6.2)

and provides a several opportunities to extract task-level and data-level parallelism (see Section 3.3), it

will be executed on OpenCL accelerators. On the other hand, since it is an intrinsically serial procedure,

the QM Update will be executed by spawning independent Markov chain instances on multiple CPU

cores. The MC state-space sampling layout corresponding to this approach is shown in Figure 4.2 (left),

together with the corresponding execution time-flow (right). Although the depicted example corresponds

to three independent chains, this number can scale with the available computational resources, as more

OpenCL accelerators and CPU cores are added to a given hardware configuration. It is important to note

that, although the PMC Cycle was the computational bottleneck in the original implementation, the high

31

QMUpdate

0.0

PMCCycle

0.0

PMCCycle

1.0

PMCCycle

2.0

PMCCycle

0.1

PMCCycle

1.1

PMCCycle

2.1

QMUpdate

1.0

QMUpdate

2.0

chain 0

PMCCycle

0.0

PMCCycle

1.0

PMCCycle

2.0

QMupdate

0.0

QMupdate

2.0

QMupdate

1.0

PMCCycle

0.1

PMCCycle

1.1

PMCCycle

2.1

Seed

... ... ...

chain 1 chain 2

State-Space

...

...

...

Execution

GPPcore

GPPcore

GPPcore

OCLAccel.

time

Figure 4.2: MC State-Space alongside with the execution timeline for three Markov chains.

performance speed-ups attained in the acceleration of this procedure considerably reduced its execu-

tion time (more details in Chapter 6). Therefore, depending on the considered acceleration platform, the

ratio between the execution times of the QM Update (t(QMupdate)) and the PMC Cycle (t(PMCcycle))

might vary considerably. Having this in mind, and by observing Figure 4.2, one can conclude that the

maximum number of independent Markov chains that can be spawned depends on the t(QMupdate)t(PMCcycle)

ratio

in the following manner:

maxchains = #Accelerators× t(QMupdate)

t(PMCcycle)+ 1 (4.1)

where the ratio t(QMupdate)t(PMCcycle)

represents the number of (accelerated) PMC Cycles required to occupy the

OpenCL accelerator while the CPU is handling the QM Update (for the sake of keeping the example in

Figure 4.2 as simple as possible, it was assumed a ratio of 2, although bigger ratios are usually observed

in read datasets - see Chapter 6). Moreover, maxchains will also be limited by the number of CPU cores

available to run the QM Updates. Since the QM Update process relies a lot on I/O disk communication,

the performance of the Host CPU may start to degrade when a higher number of processes are spawned

(as shall be shown in Chapter 6).

The multiple Markov chain parallelism strategy presented in [57] and in the some of the works dis-

cussed in Section 3.4, relies on a very high number of Markov chains to exploit parallelism in the many-

core GPU architecture. For the particular case of the approach introduced in [57], a GPU thread is

spawned to manage each Markov chain. This approach would be unfeasible for the case of the PMC

method, since each chain requires to compute not only the MC step trials (in this case, the PMC Cycle

procedures), but also the intrinsically serial QM Update process. To tackle this limitation, the approach

herein presented focused instead on exploiting task and data-level parallelism in each PMC Cycle step

(as will be discussed in Chapter 5), as well as chain-level parallelism by scheduling the tasks associ-

ated with each Markov chain (PMC Cycle and QM Update) among multiple CPU cores and OpenCL

accelerators.

32

4.3 Parallelization Strategy

By considering the Multiple Markov chain parallelism method introduced in Section 4.2.1 and the

PMC Cycle data dependency analysis presented in Section 3.3, three levels of parallelism can be ex-

tracted in the PMC QM/MM method: i) running several independent Markov Chains (chain-level par-

allelism); ii) executing the PMC Cycle procedures in parallel in respect to each other (task-level par-

allelism); iii) executing the inner iterations of each procedure in parallel, for different sections of the

dataset (data-level parallelism). At this respect, Figure 4.3 depicts the exploitation of these levels of

parallelism in the PMC QM/MM method. As discussed in Section 4.2.1, the PMC Cycle will be executed

on OpenCL accelerators, whereas the QM Update will be executed by spawning independent Markov

chain instances on multiple CPU cores. To accomplish this approach, the devised parallel solution is

mainly composed by: i) a C++ Host-side CPU program (henceforth referred to as the Host-Program)

to manage the OpenCL devices and the QM Update processes; ii) a UNIX pipe interface written in C

to manage communications between the Host-Program and the QM Update procedures (replacing the

original file-based communication); iii) a set of OpenCL kernels to accelerate the PMC Cycle execution.

Monte Carlo Step

Coulomb Nuclei QMMMVDW QMMMVDW MMCoulomb MM Coulomb Grid QMMM

Decide & Update

wi wi wi wi

+

L L L L

+

wi wi wi wi

L L L L

+ +

+ +

+

L

L

L

L

L L

Task-Level ParallelismChain-Level Parallelism Data-Level Parallelism

wi wi wi wi

+

L L L L

+

wi wi wi wi

L L L L

+ +

+ +

+

L

L

L

L

L Lwi wi wi wi

+

L L L L

+

wi wi wi wi

L L L L

+ +

+ +

+

L

L

L

L

L L

chain 0

PMCCycle

0.0

PMCCycle

1.0

PMCCycle

2.0

QMupdate

0.0

QMupdate

2.0

QMupdate

1.0

PMCCycle

0.1

PMCCycle

1.1

PMCCycle

2.1

... ... ...

chain 1 chain 2

QMupdate

0.1

QMupdate

2.1

QMupdate

1.1

(illustrative example for 3 chains)

Figure 4.3: Simultaneous exploitation of chain-level, task-level and data-level parallelism in the PMCQM/MM method.

The described management approach was taken for several reasons. Firstly, a centralized Host-

Program approach was taken, since the hardware setup targeted in this thesis is a single compute node

composed by multi-core CPUs and heterogeneous GPUs, and in this case the overhead of centralized

management is not a problem. Although the presented approach could be scaled to a multi-node com-

puting environment (e.g., using MPI), this was not considered to be a priority in this dissertation, since a

single-node heterogeneous system already allows a fairly extensive study of the employed parallelization

and load balancing schemes. Secondly, the original file-base communication system was substituted by

UNIX pipes in order to: i) free the disk from I/O burden as much as possible (since the MOLPRO program

package used in the QM Update already uses the hard-drive intensively for temporary files); ii) provide

a faster communication medium (if sufficient memory is available, the pipe inter-process communication

transfers are executed via main memory). For the QM Update side of communications, a FOTRAN/C

binding was used, and all the pipe communications code was developed in C, due to an easier access

33

1

OCLManager

OCLDevice 0

OCLDevice 1

OCLDevice D

OCLChain 0.0

OCLChain 0.1

OCLChain 0.C QM Update 0.C

QM Update 0.1

QM Update 0.0

... OCLChain D.0

OCLChain D.1

OCLChain D.C

.....

.

QM Update D.C

QM Update D.1

QM Update D.0

......

PMC HostProcess

2

3 4

OpenCLDevice D

OpenCLDevice 0

Legend

Thread

Process

Device OpenCL Command Queue

Unix Pipe

Cond. Variable Synchronization

PMC Host-Process

QM Update

FILE

Multi-Process/Multi-Thread PMCOriginal PMC

Figure 4.4: Multi-process/multi-threading structure of the designed parallel solution for the PMC method(right), alongside with the original dual-process approach (left).

to system functions from this language. Finally, since it is the bottleneck of the PMC QM/MM method, a

particular higher focus was given on accelerating the PMC Cycle procedure with OpenCL kernels. Due

to being the main target of acceleration in this work, the acceleration of the PMC Cycle will be discussed

in higher detail in Chapter 5. Likewise, in order to keep the description of the devised approach man-

ageable, this chapter will focus on describing the top-level parallel approach, leaving a more detailed

description of the finer-grain parallelism exploitation and load balancing to Chapter 5.

4.3.1 OpenCL Host-Side Management

Figure 4.4 presents the original dual-process PMC approach1, alongside the multi-process/multi-

threading structure of the designed solution. For the case of the later approach, the PMC Host-Process

is mainly composed by i) a centralized thread to manage synchronization and balancing among all

OpenCL devices (OCLManager, label 1); ii) a thread dedicated to managing the OpenCL command

queue operations (OCLDevice, label 2) for each device; iii) a thread dedicated to each Markov chain

(OCLChain, label 3), responsible for managing inter-process communication between the PMC Host-

Process and the QM Update processes (label 4). To accomplish inter-thread synchronization, the mutex

and conditional variable directives where used. Furthermore, inter-process synchronization and com-

munication was accomplished via the use of UNIX pipes, connecting each OCLChain thread to the

corresponding QM Update process. This pipe mechanism was implemented to substitute the original

file-based (disk I/O) communication system (Figure 4.4, left).

1Despite being the original PMC implementation, it is not used as the performance baseline in this dissertation,since it would not allow a representative assessment of the performance gains, in respect to the devised solution.The performance baseline is defined in Chapter 6.

34

Figure 4.5 depicts the execution work-flow of the parallel PMC program, for the case of a single-

device single-process instance (in order to keep the example manageable). The program starts by

reading the input file (step 1, Figure 4.5) containing run configurations, and the input lattice and grid

structures that will serve as starting references for the MC sampling. Then, the QM process is created

(step 2) via an execlp() call, and a UNIX pipe is opened between this process and the Host-Program to

enable inter-process communication. Following, the Host-Program will query the underlying hardware

for the available OpenCL platforms (step 3) and attempt to open an OpenCL context for each of them.

This discovery process respects several user-provided heuristics, such as allowing only certain device

types (e.g., GPUs, CPUs) or setting a maximum of selected devices. Next, the OpenCL buffers are

allocated on the selected device (step 4) and the starting references for the grid and lattice transfered

to the device (step 5) via an OpenCL command queue. Then, the first PMC Cycle (comprised of Kcycle

steps) is executed on the OpenCL device (step 6), and the resulting lattice configuration and ∆E term

read back to the Host-Program. The later then communicates these data to the QM Process via an UNIX

pipe (step 7), which then executes the QM Update (step 8). After this, the obtained grid configuration

is sent back to the Host-Program, which finally transfers it to the device, starting the next PMC Cycle.

This concludes 1 PMC iteration. The described work-flow is repeated for KPMC iterations, and then the

saved configurations are read back and printed to an output file. Since the OpenCL device may have

limited memory, the saved configurations are actually read back to the host from time to time, according

to the device maximum memory. Having the described work-flow in mind, the next subsection introduces

the developed OpenCL kernels.

4.3.1.A Load Balancing Among Multiple Markov Chains

Since the results produced by each Markov chain are equivalent, they may be sampled for a different

amount of steps, in respect to each other. Therefore, balancing the execution of the Markov chains

across different OpenCL devices is accomplished via a simple algorithm that works as follows:

1. Access a shared task-queue. If there are no tasks left, finish execution and Skip (2).

2. Execute task from task-queue.

In this approach, the balancing decision is distributed across the OCLDevice threads, although a cen-

tralized task-queue is employed to keep record of the available work-load. This algorithm doesn’t fit

perfectly in the classification scheme presented in Section 2.4, although it could be a considered as a

task-queue distributed balancing approach.

4.4 Data Structure Optimizations

Before describing in detail the fine-grained parallelism strategy (which will be presented in Chap-

ter 5), it is worth discussing the preliminary optimizations made on the original serial code. These

optimizations were employed to ensure that the obtained acceleration results were not inflated due to

under-performance of the serial baseline (more on baseline definition in Chapter 6). At this respect,

35

grid

Discover OpenCL DevicesCreate OpenCL

Buffers

Read grid file

QM Update

lattice

PMC Cycle

1

PMC Cycle

PMC Cycle

PMC Cycle

Init PMC Data

lattice

Δ E

gridgrid

lattice

Δ E

PMC Cycle

PMC Cycle

PMC Cycle

PMC Cycle

Launch QM update

Receive new grid

QM Update

lattice

Δ E

lattice

Δ ELaunch QM update

2

3

4

Host-ProgramOpenCLQueueUnixPipeQM Process

OpenCLDevice

gridgrid

PMC Cycle

PMC Cycle

PMC Cycle

PMC Cycle

Receive new grid

7

9

8

6

…...

...

...

Start Pipe QM

5

KPMCiterations

K cyclesteps

KCyclesteps

KCyclesteps

Read OutputConfigurations

Write Output File

10

11

1PMC

iteration

Figure 4.5: Program flow of the devised parallel PMC program, for the case of a single-device single-process instance (in order to keep the illustration clear). The legend for the numbered parts of this figureis presented throughout the text.

36

some algorithm modifications that were made in the parallel version were later ported to the serial base-

line, whenever such optimizations also lead to decreased serial execution time.

4.4.1 Indexing Molecules and Atoms

As introduced in Section 3, the pure MM electrostatic interactions, which are computed every PMC

Cycle step, consider the interaction between chmol and every other MM molecule stored in the lattice

(see Algorithm 2). In the original PMC implementation, the data structures employed to store the ele-

ments of the lattice were: i) 6 vectors with AMM entries, storing the parameters {x, y, z, σ, ε, q} for each

MM atom ii) a vector with AMM entries that returned the molecule id, given the atom index (henceforth

referred to as atom2mol). This approach caused many inefficient looping cycles, since one would have

to loop through every pair of atoms {i, j}, access atom2mol [i] and atom2mol [j], and then check if any

of these atoms belonged to chmol. As depicted in Algorithm 3, this would waste a lot of cycles just to

find the atoms that belong to chmol.

Algorithm 3 Original interaction loop: (A2MM−AMM )

2 cycles1: for each atom i ∈ [0, AMM − 1[ do2: for each atom j ∈ [i+ 1, AMM [ do3: if atom2mol[i]! =chmol and atom2mol[j]! =chmol then4: continue;5: end if6: compute interaction ...7: end for8: end for

To address the described inefficiency, an additional data structure was introduced, to allow map-

ping a specific molecule to the list of its respective atoms (henceforth referred to as mol2list). The

usage of this new structure reduced the total number of cycles for the MM interaction computations from(A2

MM−AMM )2 to Z(chmol)

MM ×AMM , which is a much smaller number (see Table 3.1). The resulting iteration

structure is presented in Algorithm 4. Since it enabled a faster execution of electrostatic computations,

this improvement was added to the performance baseline used in this work.

Algorithm 4 Improved interaction loop: Z(chmol)MM ×AMM cycles

Init: chmol atoms = mol2list[chmol]

1: for each atom i in chmol atoms : i ∈ [0, Z(chmol)MM [ do

2: for each atom j ∈ [0, AMM [ do3: compute interaction ...4: end for5: end for

The structure that was later used in the parallel version was slightly adapted, as depicted in Fig-

ure 4.6. Instead of returning a list with the member atoms, this new structure (henceforth referred to as

mol2atom) returns the index of the first atom belonging to the target molecule, which can then be used to

index the lattice vectors, which contain the {x, y, z, σ, ε, q} data. This structure is more suitable for GPU

platforms, since it keeps the fast access to the atoms of a target molecule that the mol2list structure

37

mol2atom

1th atom index

Molecule id

39 4041 42 43 44 4536 37 380 1 2 35...

ZQM=36 ZMM(0) =3 ZMM

(1) =4 ZMM(2) =3

{ x , y , z , q ,σ ,ϵ }

Identical accessMethod for the other variables

lattice

......

0363943...MM Molecules 1,2,3QM Molecule

Nuclei (0)

0123

Figure 4.6: mol2atom data structure, together with the lattice vectors. The mol2atom structure returnsthe index of the first atom belonging to the target molecule, which can then be used to index the latticevectors, which contain the {x, y, z, σ, ε, q} data.

provided, while also offering the possibility of reading the MM atoms directly from the lattice vectors in a

coalesced fashion.

4.4.2 Computing Distances

As introduced in Section 3, the Coulomb MM and VDW MM procedures include the computation of

the Cartesian distances between the chmol and every other MM molecule, in both the new (after the MC

step) and the old system configurations (see Algorithm 2). To save computing operations, the original

PMC implementation maintained two distance buffers, one to store the distances between all the atoms

in the reference system (old-dists), and another to store the distances in the configuration currently being

tested (new-dists). Both these buffers were implemented as a symmetric matrix with A2MM entries, such

that the entry {i, j} stores the same value as the entry {j, i}. By using this mechanism (depicted in

Figure 4.7, left), only the new-dists buffer had to be updated after a new MC step, since the distances

in the reference system were already stored in the old-dists buffer. Hence, only half of the distance

operations have to be executed, resulting in a total of Z(chmol)MM ×AMM calculations. However, to maintain

these buffers, additional memory operations had to be performed at the decision step: a) if the MC step

is accepted, the old-dists buffer has to be updated with the new distances computed for the chmol, b)

on the other hand, if the step is rejected, the new-dists buffer needs to be restored to its original state,

since it will have to be used again in the next step. Either of these options will result in 2 × Z(chmol)MM ×

AMM memory operations, since one has to restore every new/old-dists[m][n] entry for m = chmol, n ∈

[0, AMM [ and for m ∈ [0, AMM [, n = chmol. Considering the specific case of a GPU platform and a

new/old-dists buffer implemented as an 1D vector with A2MM entries, the first memory operations would

result in AMM coalesced memory writes, whereas the second memory operations would result in AMM

non-coalesced memory writes. The later might introduce significant overhead in a GPU platform, which

when also considering the quadratic memory requirement of these buffers (2×A2MM ), indicates that the

described approach to distance computation is not be suitable for GPU platforms.

In order to address this problem, the alternative approach presented in Figure 4.7 (right) was devised.

38

Distances(new)

old-dists

Monte Carlo Step

VDW MMCoulomb MM

Accept ?

chmol

new-dists

ΔE

…

old-dists new-dists

updateold-dists

yes no

On-the-fly Distances(old, new)

VDW MMCoulomb MM

Accept ?

ΔE

…

No needfor memoryoperations

here

yes no

2×ZMM(chmol)×AMM

memory operations

2×ZMM(chmol)×AMM

memory operations

2×ZMM(chmol)×AMM

distance operations

ZMM(chmol)×AMM

distance operations

Monte Carlo Step

chmol

Original Version On-the-fly Version

Data

Procedure

write read

Legend

restorenew-dists

Figure 4.7: Original approach to distance computation (left), together with the devised on-the-fly solution(right). For the sake of clarity, the distance computation procedures were singled out, although they areexecuted in the same computation loop as the Coulomb/VDW procedures. The remaining proceduresof the PMC Cycle step have been omitted for the sake of clarity.

This solution exploits the huge number of compute units available in typical many-core GPU platforms to

compute all the necessary distance operations in every iteration (on-the-fly ), totaling 2×Z(chmol)MM ×AMM

calculations. By computing these additional terms, the distance buffers ceased to be required, avoiding

both the quadratic memory requirement and the overhead of updating the distance buffers. Furthermore,

since those buffers are required to be persistent between MC iterations, using them in GPU platforms

would require reading and writing them from global memory2, whereas for the case of the on-the-fly

version, the distance values are generated and consumed in the local scope of the Coulomb/VDW MM

procedures, which results in trading many main memory operations for register operations. Moreover,

the on-the-fly version also proved to be more efficient in the CPU platform used as the baseline for this

work (more details in Section 6), thus it was also included there.

4.5 Summary

In this Chapter, a top-level description of the devised parallel solution was introduced. Firstly, the

original PMC QM/MM approach (developed at the Free Floater Research Group) was briefly described,

and the main pitfalls present in the original approach where commented on. Then, an introduction on ap-

plying Markov chain theory to the particular case of the PMC QM/MM simulation method was presented.

After this, the overall structure of the parallelization strategy was laid out, discussing details about the

structure of the developed OpenCL Host-Program and the work-flow of the complete application. Then,

a coarse-level load balancing solution to schedule the Markov-Chain workload among heterogeneous

2Considering the GPU platforms used in this dissertation.

39

devices was described. Finally, a few preliminary data-structure optimizations were discussed. A de-

tailed description of the developed OpenCL Kernels, as well as a second load balancing algorithm for

scheduling finer-grained workloads, will be presented in Chapter 5.

40

CHAPTER 5

FINE-GRAINED PARALLELISM ANDMULTI-DEVICE LOAD BALANCING

Contents5.1 PMC Cycle Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.2 Exploiting Single Markov Chain Parallelism . . . . . . . . . . . . . . . . . . . . . . . . 475.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

41

In this Chapter, a description of the devised parallel solution for extracting fine-grained parallelism

in the PMC Cycle is presented. At this respect, the OpenCL kernels developed for accelerating the

PMC Cycle procedures will be introduced and described. Following, a multi-device approach for ex-

ecuting the workload belonging to a single Markov chain is introduced, and the synchronization and

communication overheads are discussed. After this, a fine-grained dynamic load balancing solution is

presented.

5.1 PMC Cycle Parallelization

The OpenCL kernels that integrate the PMC Cycle are listed and mapped to the corresponding pro-

cedures in Figure 5.1. To minimize communication, the PMC Cycle procedures that share the same

input data (see Figure 3.4) were merged into the same kernel. Furthermore, the OpenCL version re-

quires additional kernels to finish the implemented parallel reductions. In order to keep track of the

kernel dependencies in respect to each other, OpenCL events were used to chain the kernel calls.

These kernels and the employed strategy for their parallelization will be discussed in more detail in the

next subsections.

VDW MM

Coulomb MM

Coulomb Nuclei QMMM

VDW QMMM

Coulomb Grid QMMM

Monte Carlo Step

Decide & Update

q3m_finish

q3m_c

q3m_vdwc

mm_vdwc

mm_finish

monte_carlo

decide_update

PMC Cycle Procedures OpenCL Kernels

OpenCL Event DependencyLegend:

QMupdate

PMCcycle

PMCcycle

QMupdate

PMCcycle

*

*

Figure 5.1: Mapping of the PMC Cycle procedures into OpenCL Kernels. It should be noticed that someprocedures were merged into the same kernel. Furthermore, the OpenCL version requires additionalkernels for the parallel reductions (mm finish and q3m finish, marked with a ∗).

Firstly, before entering in development details of each OpenCL kernel, the memory layout strategy

will be discussed. Figure 5.2 presents the memory layout for the main data structures used in the PMC

Cycle. The Host-Program will try to fit as many constant data in constant memory as possible, although

this memory is usually much more limited than global memory, and for most devices this will mean having

to place constant buffers in global memory. The layout depicted in Figure 5.2 is a possible instance of

such buffer distribution. As introduced in Section 3, the lattice is composed by 3 constant vectors q, σ, ε

and 3 non-constant vectors x, y, z, which are altered when the reference is updated in the decision

step. The first entries of these vectors hold the variables for the QM atomic nuclei (label 1, Figure 5.2),

whereas the later hold the MM atoms data. Furthermore, the mol2atom structure (see Figure 4.6) may

also be stored in constant memory (label 2). The grid buffers are also constant vectors by nature, since

42

(AMM+ZQM )×{x , y , z , q ,σ ,ϵ}

lattice

…

ConstantMemory

… …

… … …

… … …

… … …

GlobalMemory

…

mol2atom NMM

ZQM AMM

grid

NQM×{ x , y , z , q }

… …

…

NQM

…

1

3

2

Figure 5.2: Memory layout example for the main data structures used in the PMC Cycle.

they are not altered during the kernels execution (only by the QM process). However, the size of these

buffers is typically prohibitive (up to 320MB, according to Table 3.1) in respect to the available constant

memory of typical OpenCL devices, forcing the Host-Program to allocate these buffers in global memory

(label 3). All these data buffers were chosen to be represented as one dimensional vectors to allow a

contiguous placement in main memory, and to reduce the level of access indirection (i.e., use a single

pointer) as much as possible. As discussed further, this favors memory coalescent accesses.

5.1.1 Monte Carlo

The Monte Carlo procedure has little potential parallelism to be extracted, since it is mainly composed

by a fairly light and intrinsically serial operation: the random translation and rotation of a single random

molecule (chmol). Nevertheless, an OpenCL kernel (monte carlo) was developed to execute this task

in the OpenCL device, because this kernel manipulates data that will be used by the kernels that follow,

enabling the communication of these data via the device’s global memory, without needing to have the

Host-Program as an intermediary. Furthermore, since a random number generator would introduce

unnecessary overhead in the GPU1, the random numbers required for the MC perturbation (10 vectors)

are pre-generated in the Host-Program and sent to the OpenCL device. The size of these vectors will

depend on the device’s memory capabilities, and it is the Host-program responsibility to manage the

periodic refresh of these random lists, every Nauto steps.

As depicted in Figure 5.3, the monte carlo kernel starts by loading the necessary random parameters

from memory, then applies the perturbation to the randomly selected molecule (specified in vector rIDs

and loaded from the lattice structure), and finally writes the displaced molecule (chmol) to global mem-

ory. This structure carries the new {x, y, z} values for each atom, the molecule id (ID), and the number

of atoms of the chmol (Z(chmol)MM ). The chmol will be loaded from memory by the energy computation

1A simple pseudo-random generator such as the one provided by glibc, would require at least two global memoryaccesses for each random number vector: one to load the current generator sequence state and another to updatethis same state.

43

ConstantMemory

Nauto

… …

…

…

{rIDs ,θ , q0 , q1 , q2 , q3 , tn , t x , t y , t z }

wi

RandomMolecule ID

Rotationnumbers

Translationnumbers

Apply Perturbation

chmol

{ZMM(chmol)

×{x , y , z }, ID ,ZMM(chmol)

}

GGlobal

Memory

… … …

ZMM(chmol)

One vector per Random parameter

C

Get parametersfor current step

Work-itemwi

G

C Constant mem. access

Global mem. access

Legend

monte_carlokernel

lattice

G

Get moleculereference

Figure 5.3: Diagram for the devised monte carlo kernel, together with the layout of the data which ismanipulated in this procedure.

kernels, which are described in the next sub-sections.

5.1.2 Coulomb Grid QM/MM

The amount of parallelism that can be extracted within each kernel varies according to existing data

dependencies and the amount of input data. Accordingly, it is highest in the q3m c kernel, not only

because the Coulomb QM/MM energy interaction (Algorithm 1) is highly data-parallel, but also due to

the size of the grid it takes as input, which may vary from hundreds of thousands to millions of grid

points. The data partition scheme employed in the q3m c kernel consists in tasking each work-item

with computing the interaction between P grid points and the atoms belonging to chmol. While the

latter is the same for every work-item and might be loaded as a global memory broadcast, the former

consists of different load addresses for each work-item. In order to obtain coalescent memory accesses,

the grid partition shown in Figure 5.4 was employed. As depicted, the grid data is stored in four one-

dimensional independent vectors, one vector for each coordinate {x, y, z} and another vector for the

charge q. Each work-group performs P memory loads, where each work-item gets the vector addresses

localindex + wgsize × i (for i iterating from 0 to P − 1). Hence, by using this strategy, work-group grid

point loads always fetch contiguous addresses, thus achieving a coalesced memory access. It should

be noticed that although Figure 5.4 depicts an example for wgsize = 4 and P = 2, this is merely for

illustration purposes, as the optimal parameter choice is different according to the target OpenCL Device

44

grid

…

Coalescentmemory loads

…

NQM×{ x , y , z , q }

… …

wi wi wi wi

G G G G

G G G G

Work-Group 0

wi wi wi wi

G G G G

G G G G

Work-Group 1

P×wgsize

wgsize=4P=2

wgsize

P

Work-Group 2

wi wi wi wi

…

LegendWork-item

Global memory accesses

wi

G

wgsize=work-group size

P=grid points per work-item

Illustrativeexample for

Figure 5.4: Scheme used for partitioning the grid among the work-groups, in order to allow a coalescedmemory access pattern. For the sake of keeping the illustration clear, an example for P = 2 andwgsize = 4 is shown.

(see Chapter 6 for further details).

A diagram of the q3m c and q3m finish kernels is presented in Figure 5.5. Firstly, each work-item

loads from the global memory P grid points and the A atoms that comprise the chmol molecule. The lat-

ter corresponds to a global memory broadcast, whereas the former is performed by P coalesced global

memory read instructions, each reading a contiguous stripe of grid points to the work-group (step 1, in

Figure 5.5). Then, for each {atom, grid point} pair, the corresponding work-item computes the squared

Cartesian distance (according to a periodic box) and compares it with squared cutoffs, thus avoiding an

expensive sqrt operation. Depending on the resulting distance, the corresponding energy expression is

computed (see the cutoff branches in Algorithm 1) and the results are accumulated in private memory

(step 2). After this, work-items of the same work-group reduce the computed energies using local mem-

ory, by accumulating all terms in one memory address after log2(work-group size) iterations (label 3).

Then, the first work-item of each work-group writes the obtained partial result into global memory and a

final reduction kernel with only one work-group is launched (label 4), to reduce the remaining terms into

to a single value (different work-groups cannot communicate via global memory). Hence, by including a

first set of energy reductions in the same kernel as the ∆ECQM/MM energy computation (q3m c), expen-

sive global memory transfers that would otherwise be required between kernel launches are avoided.

Furthermore, all reductions are organized in order to favour warp/wavefront release, ensuring that half

of the active work-items finish their execution soon after each reduction iteration, thus promoting higher

GPU occupation. The corresponding reduction structure is presented in Algorithm 5.

45

q3

m_

c

Legend

wi wi wi wi

+

L L L L

+

wi wi wi wi

L L L L

+ +

+ +

+

... wi wi wi...

+

wi wi wi...

+

...

+

P

L

L

L

L

L L

G G G

wi wi wi...

+

GEC ( pi , a j )

dist (p i , a j )

EC

G

EC

G

EC

G

EC

G

EC

G

EC

G

EC

G

EC

1

2

4

Work-Group 0(Ilustrative example for 8 work-items) Work-Group 1 Work-Group W

3

G

ΔEQM/MMC,grid

...

PLG

wi Work-item

Local memory

Private memory

Global memory q3

m_

finis

h

Figure 5.5: q3m c and q3m finish kernels structure. In this example, work-group 0 was presented withadditional detail, although all work-groups share an identical structure. Likewise, the 8 work-items perwork-group configuration was adopted for simpler illustrative purposes, as the work-group size is fullyparameterizable. Furthermore, additional details concerning the first global memory accesses (label 1)are depicted in Figure 5.4.

5.1.3 Coulomb/VDW MM

The mm vdwc kernel has a similar structure to q3m c, except that it accounts for the interaction be-

tween the changed molecule (chmol) and the lattice, instead of the grid. In this kernel, the Coulomb and

the vdW interactions have been merged together, allowing the sharing of the result from the distance

computation of the same {atom, grid point} pair via private memory (registers) in the same work-item.

The reduction structure is the same as the one presented in Figure 5.5. The data structure optimiza-

tions discussed in Section 4.4 were herein employed, to avoid having to maintain a buffer to store the

distances. The parallelization structure is identical to the one presented for Coulomb Grid QM/MM (see

Section 5.1.2), apart from the involved data.

5.1.4 Coulomb Nuclei/VDW QM/MM

Unlike the other energy computation kernels, the q3m vdwc involves a much lower amount of input

data. By recalling from Section 3.2 that the complexity of the QM/MM Coulomb Nuclei and QM/MM

VDW procedures is Z(chmol)MM × ZQM , and by consulting Table 3.1, one can conclude that the amount

of loop iterations of these procedures falls somewhere in the order of magnitude of 102. Hence, if an

approach similar to the other energy computation kernels is followed, this means that the maximum

amount of work-items one can spawn will also fall in order of magnitude of 102. For this reason, the

reduction structure for q3m vdwc is simpler, as shown in Figure 5.6. In this kernel, only two work groups

46

Algorithm 5 Pseudo code for energy reduction.Init: : local size = Size of this work-group

Init: : local[localid] = Private∆EC,gridQM/MMenergy

1: for offset = local size/2; offset > 0; offset >>= 1 do2: if localid < offset then3: local[localid] = local[localid + offset] + local[localid];

4: Local Barrier. Wait for work–group.5: end if6: end for

Legend

wi wi wi wi

+

L L L L

+

wi wi wi wi

L L L L

+ +

+ +

+

+

P

L

L

L

L

L L

G

GEC ( z i , a j )

dist ( zi , a j)G G G G G G G

1

Work-Group 0(Ilustrative example for 8 work-items)

PLG

wi Work-item

Local memory

Private memory

Global memory

EC EC EC EC EC EC EC EC

wiwiwiwi

+

LLLL

+

wiwiwiwi

LLLL

++

++

+

+

P

L

L

L

L

LL

G

GEvdw ( zi , a j)

dist ( zi , a j)GGGGGGG

2 Evdw Evdw Evdw Evdw Evdw Evdw Evdw Evdw

Work-Group 1(Ilustrative example for 8 work-items)

3 ΔEQM/MMvdWΔEQM/MM

C,nuclei

Figure 5.6: q3m vdwc kernel structure. An 8 work-items per work-group configuration was adopted forsimpler illustrative purposes, as the work-group size is fully parameterizable.

are launched, one for the Coulomb Nuclei QM/MM procedure and another one for QM/MM VDW. The

former is tasked with computing and summing the EC Coulomb terms (label 1, Figure 5.6) for each QM

atom (zi) and chmol atom (aj), whereas the later is responsible for computing the Evdw terms (label

2), for the same atom pairs. Finally, when each work-group finishes execution, the accumulated energy

terms are written in global memory (label 3), to be subsequently read by the decide update kernel.

In contrast with the q3m c/finish kernels, a subsequent reduction kernel is not required, since each

energy term is reduced completely in one work group. Depending on the target OpenCL device, the

chosen work-group size might vary, and the data each work-item computes varies accordingly.

5.1.5 Decision Step

After all the energy computation kernels have terminated their execution, the decide update kernel

is launched. Figure 5.7 depicts the work-flow of this kernel. First, the accumulated results from the

previous kernels are read from global memory and added together (label 1, Figure 5.7). Then, the

step is accepted if the energy of the obtained configuration is lower than the previous configuration

reference, or accepted with a probability e∆E

KBT if the energy of the system has risen. Otherwise, the step

is rejected. For the case of accepted steps, the chmol configuration under test is copied to the current

lattice reference (label 2). Regardless of this decision, the current system configuration is saved (label

47

GG GG G

wgsize= 8

Work-item

Global mem.accesses

Illustrative example for

Work-Group 0

wi wi wi wi wi wi wi wi

Update Reference

y

G G G

G

Σ

accept?

save?

y

GG GG GG G G

GG GG GG G G

3×(AMM+ZQM)

wgsize

Exit Kernel

{ x , y , z }

wi

G

Update lattice vectors

1

2

3

4

Legend

Idle Work-item

everyFXYZ step :

Figure 5.7: decide update kernel diagram. An 8 work-items per work-group configuration was adoptedfor simpler illustrative purposes, as the work-group size is fully parameterizable.

4) to global memory every Foutput steps (see Table 3.1). This step saving operation takes 3×(AMM+ZQM )wgsize

cycles, and follows a coalesced memory write pattern. The kernel finishes execution either after this

memory operation (label 4), or immediately after the step has been decided (label 3). Furthermore, since

the saved configurations will occupy a fair amount of memory in the OpenCL device (each configuration

taking 3 × (AMM + ZQM ) numbers), the host is responsible for reading these buffers back to main

memory from time to time, and write them to an output file.

Since the typical range for the parameter Foutput is fairly high (see Table 3.1), the employed paral-

lelization scheme in the step saving will not have much impact. Nevertheless, it was implemented for

the sole purpose of having faster Debug runs, were one might want to print every step (Foutput = 1) to

see how the QM/MM system is evolving with higher granularity. This is an important feature for code

maintainability.

5.2 Exploiting Single Markov Chain Parallelism

As introduced in Chapter 4, each Markov chain represents one Monte Carlo state-space exploration

instance. The particular case of having a single Markov chain corresponds to the work-flow which was

depicted in Figure 4.5. In this approach, the QM Update and the PMC Cycle depend strictly on the

previous PMC iteration, thus only one of these procedures can be executed at a given time (as depicted

in Figure 4.5). Nevertheless, a higher amount of parallelism can still be extracted by running the PMC

Cycle instance, respecting to the same Markov chain, on multiple OpenCL devices. The workload

48

MC

VDWMM

CoulombMM

Decide & Save

Reduction

OpenCLDevice

0

Coulomb QM/MM (n.)

VDW QM/MM

MC

%Coulomb Grid

QM/MM

Decide

Reduction

Read partial ∆E's (x2) –––>

<––– Write accumulated ∆E's

Host CPU

thread 0 thread 1

Launch overhead Launch overhead

<––– Read partial ∆E's

Write accumulated ∆E's (x2) –––>

R/W Launches

%Coulomb Grid

QM/MM

ReductionR/W Launches

Barrier Sync & Sum Partials

Other CPU threads are running the QM updates in parallel.

OpenCLDevice

1

MC

%Coulomb Grid

QM/MM

Decide

Reduction

OpenCLDevice

Nthread N

G0

G1 GN...

...

QMupdate QM

update

QMupdate

Figure 5.8: Exploiting multiple heterogeneous OpenCL devices to execute the PMC Cycle. The exe-cution is balanced by executing different kernels on each device and dividing the work of the heavierkernels (q3m c and q3m reduce).

distribution of a single Markov chain among multiple OpenCL devices is discussed in the next sections.

5.2.1 Multiple OpenCL Devices

As discussed in the Section 3.2, the most computational intensive part of each PMC Cycle step

corresponds to the computation of the ∆EC,gridQM/MM energy term. In the presented OpenCL approach, this

energy calculation is handled by the q3m c and q3m finish kernels (see Section 5.1.2). Furthermore,

according to the dependency chart depicted in Figure 3.4, the procedure which these kernels execute

(Coulomb Grid QM/MM) only depends on the chmol data structure from the MC step and on the grid

data (which is written once to the OpenCL device at the start of each PMC Cycle execution). Moreover,

typical grids have hundreds of thousands to millions of points, allowing for a fine-grained partition among

devices. Accordingly, all those conditions make these two kernels excellent candidates for multi-device

acceleration.

Figure 5.8 illustrates the employed multi-device parallelization approach for a generic heterogeneous

system composed by a host CPU and N different OpenCL devices. In this approach, the Host is re-

sponsible for syncing operations between OpenCL devices, which share partial energy results on every

iteration. In this particular example, device 0 is running all kernels, although q3m c and q3m reduce

only compute part of ∆EC,gridQM/MM. Devices 1 to N , which might be accelerators with different compute

capabilities, calculate the remaining terms of ∆EC,gridQM/MM. The relative performance of the accelerators (in

respect to each other) will determine the fraction of the grid each one gets (G0 % to GN %) and where

the least complex energy computation kernels are scheduled to.

In order to keep synchronization overhead to a minimum, every device computes the MC and de-

cision kernels redundantly, although only one of the devices is responsible for saving the the sampled

configuration, since this is the heaviest part of this procedure (see Section 5.1.5). The overhead as-

sociated with the device synchronization, to be executed at every step, is caused by several factors.

Firstly, to read and write the partial energies of each device, one has to call the OpenCL functions

49

enqueueReadBuffer and enqueueWriteBuffer, which also include an implicit clFinish to wait for the pre-

vious kernels in that step to finish (launches are chained using OpenCL Events). This is accounted for

in the R/W Launches block, in Figure 5.8. Secondly, each memory transfer introduces a small overhead

corresponding to a copy of one floating-point number per reduced energy term. The number of com-

municated terms ranges from 1 to 2 terms per device, according to the employed partitioning, since it

depends on which device is computing the lighter kernels. Finally, syncing the Host-side threads that

are managing the OpenCL accelerators (Barrier Sync) and launching and parametrizing the OpenCL

Kernels (Launch overhead) also introduces some overhead.

The multi-device synchronization overheads discussed earlier do not scale with the problem size,

depending only on the number of devices that the Host-Program has to manage. Although the Host-

Program will allocate a dedicated thread for each device (see Section 4.3.1), they will compete for the

Host resources, and the effective Host-thread parallelism may degrade. Therefore, these overheads

have a complexity of O(Ndevices), although for a small number of devices in respect to the maximum

number of parallel threads that the Host CPU can run, these complexities will be in practice sub-linear

in respect to Ndevices. Table 5.1 presents the complexity of the discussed overheads, together with two

other overheads: random list refreshing (see Section 5.1.1) and output flushing (see Section 5.1.5). The

former depends on the random list refresh frequency (10 arrays with Nauto entries, every Nauto steps),

whereas the later depends on the number of saved QM/MM systems that the OpenCL device can hold

on its global memory (Nsystems), since the host will have to read back these systems before the available

memory runs out (every Nsystems steps). Furthermore, each saved system configuration has 3 arrays of

size AMM +ZQM (see Section 5.1.5), which results in the final expression presented in Table 5.1. As for

the dependence on the number of OpenCL devices, the same rational developed earlier is applicable.

Table 5.1: Complexity of communication and synchronization overheads, in respect to the QM/MM sys-tem characteristics and to run parameters.

Overhead Overhead Complexity per PMC Cycle Step

Launch Overhead

O(Ndevices)R/W Launches

Read partial ∆E

Write partial ∆E

Refresh Random Lists O(Ndevices ×

Nauto

Nauto

)Flush Output O

(Ndevices × (AMM + ZQM )× Nsystems

Nsystems

)

5.2.2 Dynamic Load Balancing

To account for the possible heterogeneity of the computational platform, the amount of grid data that

is assigned with each device on each iteration is chosen according to a dynamic load balancing algo-

rithm. At this respect, considering the classification scheme for load balancing algorithms presented in

Section 2.4, the algorithm herein described is a centralized predicting-the-future dynamic load balancing

50

approach. Accordingly, Figure 5.9 depicts the work-flow of this solution, which starts from an unbal-

anced load distribution and eventually converges to a balanced work-load distribution, after J iterations.

Furthermore, the balancer continues to monitor the performance of the computing nodes, to ensure that

the work-load distributions continues to be optimal. This solution was based on one of the algorithms

presented in [12], for the case of constant data balancing problems.

Node 0

...

Central Node

load

...

Work-loaddistribution converged

Node 1

Node K

load

load

Node 0

...

load

Node 1

Node K

load

load

Iteration J

Node 0

...

load

Node 1

Node K

load

load

Central Node

Central Node

time

time

time

time

time

time

...

...

...

...

...

time

time

time

...

...

i) Measureperformance

ii) BalanceLoad


ii) Load distributionconverged


ii) ... Legend

Measure Performance

Re-schedule Load

After J Iterations with balancing every S iterations.

Iteration 0

Iteration J+S

InitialWork-loaddistribution

Figure 5.9: Work-flow of the centralized predicting-the-future dynamic load balancing solution employedin this dissertation.

In order to apply this approach for the particular case of the q3m c/finish kernels, the 3D grid is

divided into n small and independent grid blocks. In the first step, all p devices are assigned the same

number of blocks d0i = n/p. Then, for every r steps, this distribution is conveniently updated. Thus, at

step k, device 1 computes the grid blocks b1, ..., bdk1, device 2 computes blocks bdk

1+1, ..., bdk1+1+dk

2, and

so on. All devices have access to all grid points, so that data displacement is not required.

Let ti(dki ) be the time taken by device i to compute the assigned dki blocks (plus the remaining ker-

nels it has been assigned with) in iteration k. The implemented load balance works as follows:

1. if maxall device pairs(i,j)

∣∣∣∣ ti(dki )−tj(d

kj )

ti(dki )

∣∣∣∣ < ε, the load is balanced. Skip (2).

2. Recompute the amount of assigned grid blocks: dk+1i = n× dk

i

ti(dki )×

∑pj

dkj

tj(dkj

)

At this point, it is important to recall (see Section 3.1) that different cutoff regions may result in differ-

ent energy expressions (or no energy computation at all). Consequently, the accelerators might com-

pute over grid partitions that fall into different cutoff regions, thus having different computational efforts.

51

Hence, execution time measurements should be computed by averaging the execution time in several

previous steps, in order to avoid any misclassification of device performance. The described algorithm

is run by the Host once every Fbal steps, after the Barrier Sync (see Figure 5.8). Since the compu-

tational cost of a given step depends on the chmol position in space (which defines the cutoff region

center), Fbal should be in the order of magnitude of NMM (103, see Table 3.1), to ensure that in average

every molecule has been moved once between balancing steps. Furthermore, in order to be able to

make the balancing decision, the performance measurements of each device are shared between the

corresponding Host-threads, via Host shared-memory.

5.2.2.A Problem Partitioning Approaches

In the literature on accelerating MD simulations, additional data structures which allow a fast verifica-

tion of interaction cutoffs are sometimes employed. A typical approach consists in using a neighbor-list

method [4, 56]. A neighbour-list is a data structure (e.g. Matrix or vector of Linked Lists) that records,

for each particle, the particles that are close enough to it, according to a cutoff. This spares unneces-

sary computation over molecules which are already known to be outside the interaction cutoff. Another

possible approach consists in a cell-list method [49], which employs a similar data structure, although

the space is partitioned in geometrical cells, instead of the more relaxed (in the geometrical sense)

neighbor-list approach. Although these works target MD simulations, a similar approach could be un-

dertaken for the PMC QM/MM (which is an MC method) to partition the QM grid points. However, since

the grid has to be updated every PMC (outer) iteration (by the QM Update), it is not obvious if a cell-list or

neighbor-list approach would bring additional performance gains, since the overhead of creating these

lists would have to be repeated every PMC iteration. The study of a possible grid partitioning approach

is left for future work.

5.3 Summary

In this Chapter, a description of the devised parallel solution for extracting fine-grained parallelism

in the PMC Cycle was presented. At this respect, the developed OpenCL kernels were introduced and

described, and the devised OpenCL memory layout structure was presented. Since the heaviest kernels

correspond to q3m c and q3m finish, a higher amount of detail was given to the description of these

procedures. Then, a multi-device approach for executing the workload belonging to a single Markov

chain was introduced, and the synchronization and communication overheads discussed. Finally, a fine-

grained dynamic load balancing solution to efficiently take advantage of heterogeneous OpenCL devices

was presented.

52

CHAPTER 6

EXPERIMENTAL EVALUATION

Contents6.1 Benchmarking Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.2 PMC Cycle Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.3 Global PMC Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.4 Numerical Evaluation: Convergence Accuracy and Energy Consumption . . . . . . 636.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

53

Table 6.1: Considered QM/MM benchmark datasets. The chemical aspects of bench-R are presentedin detail in [11].

Benchmark MM Part (NMM ) QM Part Grid size (NQM ) Total MC stepsbench-R 500 H2O molecules Chorismate 183,356 24.8×106

bench-A 1301 H2O molecules 1 Arginine 1,772,972 10×103

bench-B 5000 H2O molecules 1 Arginine 1,772,972 10×103

bench-C 5000 H2O molecules 2 Arginines 2,992,458 10×103

In this Chapter, a detailed performance assessment of the devised parallel heterogeneous approach

to the PMC QM/MM method is presented and discussed. Firstly, the considered performance metrics

and the employed profiling tools are described. Then, the benchmarking setting will be discussed, both

in terms of the considered chemical datasets and the target hardware platforms. Following, the perfor-

mance baselines are defined, establishing a term of comparison for the obtained acceleration results.

After this, the acceleration results obtained both in the simulation bottleneck (PMC Cycle) and in the full

simulation are presented and discussed, and the scalability of the devised solution is evaluated. Finally,

the numerical quality and energy consumption are evaluated for alternative numerical representation

schemes.

6.1 Benchmarking Setup

6.1.1 Chemical Datasets

To experimentally evaluate the proposed parallelization approach, four sets of chemical datasets

were carefully designed by chemical experts from the Institut fur Physikalische Chemie, Georg-August-

Universitat Gottingen. The three benchmarks, namely bench-A, bench-B and bench-C (see Table 6.1),

represent typical QM/MM setups, that will be mainly used to assess the performance of the most de-

manding simulation step (i.e. PMC Cycle), by running 10k steps. The QM part of these benchmarks

consists of a set of protonated arginines that are acylated at the N-terminus and methylaminated at the

C-terminus. This amino acid was solvated in a periodic water box (MM part), containing a variable num-

ber of water molecules (depending on the considered benchmark). The grid for the electronic charge

density description was constructed by following Mura and Knowles (α = 1 and m = 3) for the radial

distribution [36] and Lebedev (lmax = 53) for the angular distribution [27]. The QM calculations used the

density functional PBE [43] and the basis set def2-SVP [54], while the MM part was described with the

OPLS-AA force field [25]. Furthermore, the latest development version of the MOLPRO [55] program

package was used in the QM calculations.

The fourth benchmark, bench-R, consists of a smaller simulation box, designed for a much longer

and realistic run. This simulation corresponds to the chorismate molecule in solution. Its conversion to

prephenate is a widely studied biochemical reaction, and the respective chemical aspects are described

in [11]. For this benchmark, the run is comprised of 24.8 million steps, with the QM update executed

every 50k steps (totalling 496 PMC outer iterations).

54

Table 6.2: Considered execution platforms in the experimental evaluation.

Platform Host CPU RAM(CPU) OpenCL Accelerators RAM(Accel.)mcx0 Intel Core i7-4770K 4-core 3.5GHz 32GB - -mcx1 Intel Core i7-4770K 4-core 3.5GHz 32GB Nvidia GTX 780Ti 3GBmcx2 Intel Core i7-4770K 4-core 3.5GHz 32GB Nvidia GTX 780Ti/660Ti 3GB/2GBmcx3 Intel Core i7-3820 4-core 3.6GHz 16GB AMD R9 290X/Nvidia 560Ti 3GB/1GBmcx4 2x Xeon E5-2609 (4-core each) 2.4GHz 32GB 2x Nvidia GTX 680 4GB/4GBmcx5 Intel Core i7 3770K 4-core 3.5GHz 8GB Nvidia K20C 5GBmcx6 Intel Core i7-4770K 4-core 3.5GHz 32GB i7-4770K 32GB

6.1.2 Hardware Platforms

The considered hardware for the experimental setup is listed in Table 6.2. The considered plat-

forms correspond to several hardware configurations of the machines available at the SiPS research

group, which include Intel i7 CPUs, Nvidia GPUs and AMD GPUs. These platform configurations were

selected to allow a fairly complete evaluation of the devised parallel solution: i) mcx0 will be used as

the performance baseline (more details in Section 6.1.3), ii) mcx1 and mcx2 were selected to evaluate

the load balancing solution between two GPUs with very different compute performances (GTX 780Ti

and GTX 660Ti), iii) platform mcx3 was selected to evaluate the performance of an highly heteroge-

neous system composed by GPUs of different vendors (AMD R9 290X and Nvidia 560Ti), iv) mcx5 will

mainly be used to assess energy consumption (since it supports NVML power measurements), v) mcx6

will be used to evaluate the parallel OpenCL solution when running on a multi-core CPU. In the pre-

sented platform configurations the Host-CPU will be both managing the OpenCL devices and running

the QM Updates, using all the available cores.

Different OpenCL work-group partitioning schemes were used for each device. For Nvidia GPUs,

the CUDA calculator [41] proved to be a useful tool for choosing starting point parameters. For AMD

cards and Intel CPUs, the optimal values were found through test and experimentation, resulting in

small multiples (e.g. 1 to 4) of the preferred elementary work-group size returned by an OpenCL device

discovery query, made in runtime to the underlying platform. The newest available OpenCL standard

was used for each device (OpenCL 1.1 for the considered Nvidia GPUs and OpenCL 1.2 for the Intel

CPUs and AMD GPUs).

6.1.3 Performance Baseline

The original PMC QM/MM single-core code was reviewed and optimized, to ensure that the obtained

acceleration results were not inflated due to under-performance of the serial baseline. Accordingly, the

optimizations discussed in Section 4.4 were added to the original algorithm. Most of the performance

comparisons presented in this Chapter are relative to the optimized version of the reference code ex-

ecuted on a single core of the i7-4770K processor (platform mcx0), compiled with Intel compiler (ICC

v13.1.3) with flags -O3 -xCORE-AVX2, unless otherwise specified. This baseline will henceforth be re-

ferred to as avx2-baseline. Figure 6.1 and Figure 6.2 illustrate a profiling evaluation of the avx2-baseline,

using the bench-A input dataset. In particular, Figure 6.2 presents the overall execution results for one

PMC iteration, whereas Figure 6.1 depicts a more detailed overview of each step of the simulation bot-

55

Monte Carlo Step

Coulomb QM/MM

VDW QM/MM

VDW MMCoulomb MM

+

Time perstep (μs)

32

356435

2

2

76077

94

CoulombGrid

QM/MM

Δ EMMC

Δ EMMvdW

Δ EQM/MMvdW

Δ EQM/MMC,nuclei

Δ EQM/MMC,grid

QMupdate

PMCcycle

PMCcycle

QMupdate

PMCcycle

PMC Cycle

Update Reference

Output Result

yAccept ?

Δ E

Figure 6.1: Time footprint for a single PMC Cycle step for the bench-A dataset running on the avx2-baseline.

0

50

10

0

15

0

20

0

25

0

30

0

35

0

40

0

45

0

50

0

55

0

60

0

65

0

70

0

75

0

80

0

85

0

PMCCycle

time(s)QM

Update

1IPMCIouterIIteration

Figure 6.2: One complete PMC outer iteration, comprised of 10k PMC Cycle steps and a QM Update,for the bench-A dataset running on the avx2-baseline. The bottleneck of each PMC iteration is the PMCCycle.

tleneck. As predicted in Section 3.2, the Coulomb Grid QM/MM procedure (∆EC,gridQM/MM) represents (for

all the tested input QM/MM systems) the most time consuming part of each PMC Cycle step, since

O(Coulomb Grid QM/MM) = Z(chmol)MM ×NQM and NQM tends to be a very large number (1, 772, 972, for

the case of bench-A).

Furthermore, both double-precision (fp64 ) and mixed double and single-precision (fp64 -fp32 ) data-

types will be employed in the performance study made in this section. Details about these numerical

configurations and the corresponding compromises, as well as a mixed fixed-point precision approach,

will be discussed in Section 6.4.

6.2 PMC Cycle Acceleration

The main performance metric of choice is the execution time of the accelerated application. However,

to further show the benefits of the proposed parallelization approach, it is herein adopted the application

speed-up in respect to the baseline serial execution:

Speedup =TbaselineTparallel

(6.1)

56

where Tbaseline is the execution time of the baseline performance, corresponding to the serial execution

on the host CPU, when compiled in Intel compiler (ICC v13.1.3) with flags -O3 -xCORE-AVX2, such as to

enable automatic loop vectorization and the usage of AVX2 vector instructions in compliant processors

(e.g., Intel 4th generation core i7). Furthermore, Tparallel represents the execution time of the proposed

solution using the system under test. In order to measure Host-side execution times, the PAPI [35]

library is used. For evaluating kernel execution time and buffer transfers to the OpenCL devices, OpenCL

Profiling Events are used instead, since they allow a finer measurement of OpenCL device operations. In

order to identify execution bottlenecks and guide the process of algorithm acceleration, the kcachegrind

tool (based on valgrind [37]) was employed.

Table 6.3 presents the PMC Cycle execution time (10k steps) for benchmarks bench-A, bench-B and

bench-C, profiled for several hardware configurations. The overall execution time corresponds to the

cost of running 10k steps plus the final output flushing from the OpenCL device back to the host and file

writing (Output time in Table 6.3, ranging from ∼0.5s to ∼2s). The extra overheads related to the time for

the OpenCL initialization and input file reading was not accounted for, because they do not scale with the

simulation size and would be diluted in longer runs (contrary to the output generation). Since the amount

of generated output scales with the number of executed steps, this overhead would repeat itself every

10k steps (for this particular run), and therefore it is taken into account in the speed-up calculations.

Table 6.3: Execution time (in seconds) for a PMC Cycle with 10k steps, in several hardware platforms,when using fp64 -fp32 mixed-precision. The column ”Total” corresponds to the complete execution timesof the PMC Cycle (10k steps), including the final serial overhead of reading back and writing the outputto a file. This overhead is discriminated in column ”Output”. The presented execution times correspondto a median among four experimental trials, for each platform configuration.

bench-A bench-B bench-CPlatform Accelerators Total Output Total Output Total Outputmcx0 none (avx2-baseline) 769.96 0.231 787.68 0.755 1179.70 0.760mcx6 i7-4770k 137.90 0.899 140.25 6.751 232.20 6.873mcx1 780Ti 6.33 0.534 8.08 1.699 11.20 1.717mcx2 780Ti/660Ti 5.04 0.517 6.73 1.657 8.68 1.656mcx3 R9 290X/ 560Ti 6.90 0.572 8.42 2.015 11.30 2.019

The difference between the execution time of bench-A and bench-B corresponds to the number of

MM molecules, which has two implications, namely bench-B imposes a heavier footprint of the mm vdwc

and mm finish kernels, and an increased size of the generated output, which in turn means a heavier

decide kernel and a longer output flushing. The latter can be observed in Table 6.3 and mainly depends

on Host-to-Device communication speed to read back the output, and on the time to write the output file.

Consequently, it is higher in the parallel platforms, since the output has to be read back from an external

device (in respect to the OpenCL-Host).

On the other hand, bench-C has a larger QM part, resulting in heavier q3m c and q3m finish kernels.

This favors the overall performance in respect to bench-B, as the performance of the most data-parallel

kernels is favored by a higher number of grid points. The speed-up results of the parallel platforms in

respect to the avx2-baseline (corresponding to the execution times presented in Table 6.3), are depicted

in Figure 6.2.

57

780Ti

5.08

105.36

135.96

104.38

5.62

97.50

117.06

93.60

5.58

121.69

152.89

111.59

0 50 100 150

I7-4770k

780Ti

780Ti/660ti

R9 290X/560Ti

OpenCLAccelerators

bench-A

bench-B

bench-C

8threads

mcx3

Platform

mcx2

mcx1

mcx6

Speed-up versus avx2-baseline

®

Figure 6.3: Speed-up obtained for a PMC Cycle with 10k iterations, when using fp64 -fp32 mixed-precision. The Corresponding execution times are presented in Table 6.3.

According to the presented results, the speed-up values in the PMC Cycle acceleration are fairly

high when compared to avx2-reference. This is a direct consequence of a careful exploitation of the

memory hierarchy, together with the higher memory bandwidth in GPUs architectures. In fact, although

CPUs compensate the lower main memory bandwith with multiple levels of high-speed caches, the most

intensive procedure in the PMC Cycle (Coulomb Grid QM/MM) requires loading a huge amount of data

from main memory at each step (e.g., up to 48MB for the case of bench-C), rendering the first cache

levels useless. Nevertheless, coalesced memory accesses still exploit parallelism when accessing the

main GPU device memory, regardless of using local caches or not.

Table 6.4 presents kernel execution times for the particular case of the GTX780Ti accelerator, to-

gether with the times corresponding to the reference implementation in the avx2-baseline platform. As

can be observed, the kernels that achieve the highest speed-up are q3m c and q3m reduce, as pre-

dicted in Section 5.1.2. The very large speed-up attained in these kernels (160.84×) is subsequently

affected by Amdahl’s Law (considering the fractions and speed-ups of all the other kernels) and results

in an overall PMC Cycle step speed-up of 135.55×. By recalling the execution times presented in Ta-

ble 6.3 for the particular case of the GTX780Ti accelerator, the speed-up without considering the Output

overhead would be 769.96−0.2316.33−0.534 = ∼132.8× (versus the value of 121.29× presented in Figure 6.2, were

every component is taken into account), which is slightly below the speed-up attained in the PMC Cycle

step, due to device management and kernel launching overheads, not accounted for in Table 6.4.

Furthermore, two additional details are worth commenting on. Firstly, the monte carlo kernel is faster

on the GPU, since it relies on pre-generated random number lists, which are computed by the Host in

parallel and refreshed when necessary. Conversely, the baseline version is computing these numbers

on-the-fly, resulting in a heavier Monte Carlo step. Secondly, the decision kernel is also faster because

the results are accumulated locally and only read back and written in a file from time to time, thus being

contemplated in the Output fraction of the profiling (see Table 6.3).

58

Table 6.4: Kernel execution times obtained in the GTX780Ti accelerator and the in the reference avx2-baseline platform, for the particular case of bench-A. The speed-up in respect to the avx2-baseline isalso presented, together with the fraction of the PMC Cycle (%) each kernel represents.

Platform PerformanceKernel avx2-baseline GTX780Ti Speed-upmonte carlo 32us 0.04% 17us 3.0% 1.88×q3m c/finish 76077us 98.8% 473us 83.3% 160.84×mm vdwc/finish 791us 1.03% 40us 7.0% 19.77×q3m vdwc 4us 0.01% 18us 3.2% 0.22×decide 94us 0.12% 20us 3.5% 4.70×total 76998us 100% 568us 100% 135.55×

6.2.1 PMC Cycle Load Balancing

Figure 6.4 presents the kernel timing results per PMC Cycle step, when considering bench-A execut-

ing on the mcx2 heterogeneous platform. The load balancing algorithm introduced in Section 5.2.2 was

used and converged to the grid partitioning depicted in this figure. Figure 6.5 illustrates the time evolution

of the workload balancing. Here, the balancing term r is set to 2000 iterations, in order to avoid under-

sampling the computational weight of the q3m/mm kernels (which depends on the randomly picked MM

molecule). The starting workload distribution of 50%/50% converges to approximately 71%/29% in only

4 balancing steps, favouring the more powerful 780Ti GPU. When this distribution is reached, one can

observe that the execution of the balanced workload in each GPU takes practically the same time, which

means that the load is balanced and that the balancing mechanism has met its purpose. It is worth re-

calling that the employed balancing solution was designed to distribute the workload of the q3m c/finish

kernels (corresponding to the Coulomb QM/MM part in Figure 6.4), although the measurements taken

into account to make the balancing decision include all the other kernels and overheads, since one

wishes to balance each PMC Cycle step. In order to illustrate how the balancing persists even after the

10k-th run, the chart represents the execution up to 20k steps. When compared with an unbalanced

run (e.g., fixed 50%/50% workload distribution) on the same platform, the balanced version yields a

speed-up of 1.3×, further justifying the advantage of having incorporated a load balancing solution.

6.2.2 PMC Cycle Scalability

The memory footprint of the PMC Cycle kernels in the OpenCL accelerators is mainly limited by

the program output, pre-generated random lists, MM lattice and the QM grid. The first two solely de-

pend on the number of executed steps, and are addressed by having the Host CPU flushing the output

and refreshing the random lists periodically. The second two were also not a problem for the selected

benchmarks, since the largest used QM grid and MM lattice occupy ∼48MB and ∼160KB, respec-

tively. Nevertheless, it is important to note that the scalability of the proposed implementation is not

compromised even when significantly larger simulations are considered. To address such cases, the

following solution is envisaged: the q3m/mm kernels may concurrently execute over one chunk of data

while the Host CPU is transferring the next chunk. This double-buffering mechanism can be achieved

in OpenCL devices by using a second OpenCL command-queue and another Host CPU thread to issue

59

GTX780Ti

MC

VDWMM

CoulombMM

Decide & Save

Reduction

GTX660Ti

Coulomb QM/MM (n.)

VDW QM/MM

MC

71%Coulomb QM/MM

Decide

Reduction

Read partial ∆E's (x2) –––>

<––– Write accumulated ∆E's

Host GPP

thread 0 thread 1

Launch overhead Launch overhead

<––– Read partial ∆E's

Write accumulated ∆E's (x2) –––>

R/W Launches

29%Coulomb QM/MM

ReductionR/W Launches

Barrier Sync & Sum Partials

Avg. Time per PMC cycle step (μs)

Host-side overhead

OpenCLtime

16

45

5

19

227

15

2.6

0.6

20

28

46

22

Avg. Time per PMC cycle step (μs)

Host-side overhead

OpenCLtime

17

313

29

1.6

1.2

9

18

43

17

Other GPP threads are running the QM updates in parallel.

Figure 6.4: OpenCL kernel timings (per step) for the PMC Cycle running on the mcx2 heterogeneousplatform. The load is balanced for the heavier kernels (q3m c/q3m finish, corresponding to CoulombQM/MM), whereas the lighter kernels were scheduled to the first GPU. The considered benchmark isbench-A, using mixed fp64 -fp32 precision.

0100

200

300

400

500

600

700

800

900

1000

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000Step

780Ti660Ti

36.5 31.2 29.6 29.0 28.8 28.8 28.4 28.6 28.7

63.5 68.8 70.4 71.0 71.2 71.2 71.6 71.4 71.3

780Ti

660Ti

0[

10[

20[

30[

40[

50[

60[

70[

80[

90[

100[

50.0

50.0

Wo

rklo

adPd

istr

ibu

tio

n

Tim

ePp

erPP

MC

PCyc

lePs

tep

P[u

s]

WorkloadDistribution

Time per PMC Cycle step [us]

Figure 6.5: Convergence pattern of the implemented load balancing algorithm (balancing every 2000steps), for the Bench-C running on the GTX 780Ti/660Ti platform (mcx2). The presented PMC cycletime measurements represent mean times since the previous balancing.

the memory transfer operations. Since all the considered QM/MM systems have a memory footprint far

below than the maximum memory available in the considered acceleration platforms, implementing this

double-buffering mechanism was not considered a priority in this dissertation.

Other overheads worth discussing are the synchronization events related to scheduling the com-

putations belonging to a single Markov chain among multiple GPUs. These come with the additional

overhead of synching the PMC Cycle step results among the involved devices, at the end of every

step. Fortunately, these overheads do not scale with the simulation size, since the buffers that need to

be synched back and forth (between the Host and the accelerators) are reduced energy terms, each

represented by a single number. For the particular example presented in Figure 6.4, each device has

to read/write three reduced terms per step, which has a performance impact of a few dozen micro-

seconds. Conversely, the computational cost of the q3m c/finish kernels scales with the size of the QM

60

1.56

1.57

1.58

1.59

1.6

1.61

1.62

1.63

1.64

4.5 5 5.5 6 6.5 7 7.5 8 8.5

spee

d-u

p(o

f(ad

din

ga(

seco

nd

(GTX

68

0

Grid(Size((Milions(of(Points)

Figure 6.6: Scalability of the PMC Cycle when changing the size of the QM part in bench-A. Speed-upresults are presented for a dual GTX680 system in respect to a single GTX680 (platform mcx4).

grid, meaning that multi-device scalability is better for larger grids (which concern the most computation-

ally challenging problems).

Figure 6.6 presents the obtained speed-ups for the PMC Cycle kernels acceleration, when a second

GPU is added to the system (platform mcx4) to balance the same Markov chain, by considering several

QM grid sizes. As can be observed, multi-device performance scales better for simulations with greater

QM parts, which actually represents a common characteristic of real QM/MM systems. Considering

the high speed-up results obtained in the q3m c/finish kernels, one might expect the execution time of

these kernels to rise slowly with the introduction of larger grids, justifying the slow scaling of the dual-

device speed-up curve presented in Figure 6.6. Nevertheless, it is important to recall that the execution

time of the PMC Cycle kernels (for each Markov chain) was already accelerated to the micro-second

order of magnitude for a single device, and that at this level, every overhead is noticeable. Therefore,

the observed dual-device speed-up, ranging from ∼1.5× to ∼1.6×, is deemed favorable for grids up

to 8 million points. This speed-up would continue to rise (never exceeding 2×) for larger and more

computationally intensive QM/MM systems.

Finally, it is worth noting that the represented configuration, where a single Markov Chain is run

in multiple accelerators, is particularly useful when #accelerators > #chains, which corresponds to a

situation in a many-node heterogeneous cluster. The scalability in a multiple Markov chain situation, for

the case when #accelerators < #chains, will be discussed in Section 6.3.

6.3 Global PMC Results

To conclude the evaluation of the proposed parallel solution, the execution of the complete PMC

simulation is assessed (including the QM Update and the PMC Cycle stages). For such purpose, a

greater focus will be given to bench-R, corresponding to the longest and more realistic dataset. A de-

tailed discussion in terms of the chemical aspects of the obtained results is discussed in [15], validating

the results with the work where this dataset was first described [11]. Figure 6.7 depicts the simulation

results, showing the conversion of the chorismate structure into prephenate.

Table 6.5 presents the execution times for the inner PMC Cycle (comprising 50k steps), the QM Up-

61

Figure 6.7: QM/MM Simulation box for the bench-R dataset (partial representation), together with thesimulation results for the conversion of the chorismate structure into prephenate.

Table 6.5: bench-R execution time for the PMC Cycle (50k steps) and QM Update (24.8M iters) stages,as well as for the full PMC simulation. The presented results consider two baselines and four parallelsolutions, with either a single or 8 Markov chains and fp64 or fp64 -fp32 precision.

Execution time (s)Setup PMC Cycle QM Update Full Simulation

mcx4-baseline 1883.1 96.4 980038.0 s = 272.23 h300028.0 s = 83.34 h69026.4 s = 19.17 h10156.2 s = 2.82 h52833.4 s = 14.68 h7757.7 s = 2.15 h

avx2-baseline 572.7 32.2mcx4 fp64 1-chain 42.9 96.4mcx4 fp64 8-parallel-chains 42.9 113.4mcx4 fp64 -fp32 1-chain 10.2 96.4mcx4 fp64 -fp32 8-parallel-chains 10.2 113.4

date and the full PMC application (comprising 496 PMC outer iterations, which yields a total of 24.8M

PMC Cycle steps). This performance study was conducted with the mcx4 platform, since it has the

largest number of CPU cores (8), allowing to spawn up to 8 independent Markov Chains while schedul-

ing their respective PMC Cycles on two GTX680 GPUs. It is important to notice the chemical importance

of spawning multiple Markov chains, as it allows a better coverage of the chemical solution space, thus

improving the quality of the results. The execution times were measured for a single and for 8 Markov

chains, as well as for two reference versions: the avx2-baseline and one single core of mcx4 (henceforth

referred to as mcx4-baseline). Both reference timings are presented to avoid misclassifying the pre-

sented parallelization quality. In fact, although the reference with the best performance is avx2-baseline,

directly comparing the parallel solution on mcx4 with an AVX2-enabled core of Intel i7-4770k CPU would

be unfair, because the latter runs the QM Update roughly 3× faster (32.2s versus 96.4s), mitigating the

performance gains in the OpenCL accelerated part (the PMC Cycle). Nevertheless, considerable speed-

up gains are achieved even when comparing to the faster avx2-baseline reference. Table 6.6 presents

the corresponding speed-up results versus both baselines (using the timings introduced in Table 6.5).

Considering all these run configurations, the execution time of the full bench-R for the two reference

scenarios correspond to 272.23 h (hours) and 83.34 h, respectively. The parallel solutions, reduce these

execution times considerably, ranging from 19.17 h to 2.15 h, depending on the number of spawned

Markov chains (either single or 8 chains) and the chosen numerical precision. For the single-chain

case, the obtained speed-up is mainly due to the OpenCL acceleration of the PMC Cycle. As shown

62

Table 6.6: Performance speed-ups for bench-R, considering the execution times presented in Table 6.5.

Speed-up versus mcx4-baselineSetup PMC Cycle Full Simulation

mcx4 fp64 1-chain 43.80× 14.20×mcx4 fp64 8-parallel-chains 43.80× 96.50×mcx4 fp64 -fp32 1-chain 184.23× 18.55×mcx4 fp64 -fp32 8-parallel-chains 184.23× 126.33×

Speed-up versus avx2-baselineSetup PMC Cycle Full Simulation

mcx4 fp64 1-chain 13.32× 4.35×mcx4 fp64 8-parallel-chains 13.32× 29.55×mcx4 fp64 -fp32 1-chain 56.02× 5.67×mcx4 fp64 -fp32 8-parallel-chains 56.02× 38.68×

in Table 6.6, a speed-up of up to 184.23× is obtained in the PMC Cycle alone. However, this speed-

up will be affected by Amdahl’s law, due to the QM Update fraction running on the CPU. In fact, by

looking at the mcx4-baseline reference scenario, one can observe that in the original run the PMC Cycle

represented 1883.11883.1+96.4 = 95.13% of each PMC iteration (PMC Cycle + QM Update). Hence, the speed-

up of 18.55× presented in Table 6.6 was expected, since the speed-up of 184.23× obtained in the PMC

Cycle (fp64 -fp32 version) would at maximum yield 10.9513184.23+0.0487

u 18.57× global speed-up. Therefore,

one can observe that the single-chain runs are limited by the QM Update fraction (4.87% in the mcx4-

baseline scenario), which uses the MOLPRO closed source program package, a necessary tool in the

current approach to the involved QM chemical calculations [14]. To tackle this limitation, the multiple

Markov chain approach was devised in this dissertation.

For the multiple Markov chain case, the attained speed-up is mainly due to parallel MC state-space

exploration. In fact, by comparing the single with multiple chain speed-up values for the same precision

approach, a scalable speed-up trend can be observed from the obtained results. For example, by

comparing the speed-ups attained in the mcx4 fp64 -fp32 for the cases of 1 and 8 chains, a speed-up

ratio of 126.33×18.55× = 6.81× is obtained. It is important to recall that the speed-up attainable by adding more

chains is limited by Equation 4.1, by the Host-side thread management, and by the overhead introduced

by concurrent memory and disk accesses issued by the CPU cores running the QM Updates in parallel.

In this case, one can verify from Table 6.5 that the mean QM Update execution time has degraded from

96.4s to 113.4s. Furthermore, although Equation 4.1 would yield a theoretical maximum of 23 chains for

this particular case, we are limited by the 8-cores available in mcx4, thus limiting the maximum number

of chains one can run on that platform to 8. Hence, the considered multiple Markov Chain solution

is achieving an efficiency of 38.685.67×8(#cores) = ∼85%. Nevertheless, one can conclude that by using the

same GPUs for the PMC Cycle acceleration, the proposed implementation would scale well to integrate

a system with up to 23 CPU cores. Increasing the number of OpenCL accelerators would increase this

limit even further.

Among the presented results, the most conservative speed-up of this parallel implementation is as-

sumed to be 36.86× (mcx4 fp64 -fp32 8-Chain versus avx2-baseline - see Table 6.6), as the avx2-baseline

corresponds to the reference with best performance. Naturally, the 126.33× speed-up obtained when

63

comparing mcx4 to itself could remain close to this value if a better Intel Xeon CPU had been used in

both the reference and the parallel solutions.

6.4 Numerical Evaluation: Convergence Accuracy and Energy Con-sumption

While the proposed parallel implementation does not make any approximation or relaxation in re-

spect to the original sequential method, yielding exactly the same output as the original PMC imple-

mentation, it is important to consider different numerical precisions, and to evaluate how they impact

execution performance, energy consumption and quality of the results. Accordingly, besides the original

64bit floating-point representation (fp64), the presented OpenCL version offers the following numerical

representation alternatives : i) mixed 64bit and 32bit floating-point (fp64 -fp32 ), or ii) mixed 64bit/32bit

floating-point and 32bit fixed-point (fp32 -i32 ). In the fp64 -fp32 version, the computationally more complex

q3m finish/mm finish kernels use 32bit floating-point precision for the ∆EC,gridQM/MM energy computations,

whereas 64bit floating-point is employed for the remaining energy terms, which have much faster com-

putations. This configuration also assumes the same data-type to store the grid, as well as a copy of the

lattice and the chmol. Likewise, the fp32 -i32 version uses 32bit floating-point numerical representations

for the same energy computation, but it uses 32bit fixed-point for the squared distances. The latter op-

erates on normalized grid and atom coordinates, represented by 32bit integers, which actually provides

a higher precision than the alternative 32bit floating-point. The usage of mixed precision for different

energy terms calls for casting operations, which may degrade the performance in a GPU accelerator. To

circumvent this degradation, all the necessary casting operations were moved to the monte carlo and

decide kernels, concentrating the necessary conversions in the single-threaded procedures of these

kernels and avoiding redundant casting in the many-thread kernels.

The resulting performance gains, for each of the considered precisions when executing the q3m c

kernel, are presented in Table 6.7. Depending on the adopted GPU device, execution speed-ups as

high as 8.89× can be attained by simply adopting lower resolutions, with minor degradations of the

obtained energy results. However, the generated system configurations will be the same as long as

the accumulated error does not cause the sequence of selected systems to diverge, which was verified

to be the case for all the considered benchmarks. In Table 6.8, the error introduced in the ∆EC,gridQM/MM

term and the total system energy (E) are presented for each kernel version, with respect to the fp64

implementation (which is numerically equivalent to the original serial version). It can be observed that

the fp32 -i32 version offers higher precision than the fp64 -fp32 , due to the greater number of significant

bits used for the squared distances operations. In these simulations, it was assured that em = 1.0 ×

10−1kJ/mol was the maximum error, as commonly considered in this research domain.

In order to further assess the impact of the considered mixed precision solutions, the averaged

energy consumption was measured on the Nvidia K20C GPU, by using the NVML library. The method

introduced in [8] was used to gather the attained power measurements, by using the maximum allowed

sampling frequency of 66.7Hz. Since this frequency is too low to sample one kernel launch of q3m c

64

Table 6.7: Speed-up of the mixed precision q3m c kernel versions versus the original fp64 version,running on the same machine, for the case of bench-A.

q3m c speed-up (vs f64)version GTX680 GTX780Ti K20Cfp64 -fp32 8.56× 7.39× 2.65×fp32 -i32 8.89× 7.44× 2.74×

Table 6.8: Obtained numerical precision. The error is shown for the ∆ECQM/MM energy term, as well as for

the total energy of the system (E), when considering the em = 1.0 × 10−1kJ/mol maximum error. Theaverage values were taken from the complete set of generated QM/MM systems, by using bench-A.

Error vs fp64

∆ECQM/MM(kJ/mol) E(kJ/mol)

version measurement value % of em value % of em

fp64 -fp32mean 6.4×10−5 0.064 4.2×10−3 4.2max 2.9×10−3 2.9 1.6×10−2 16

fp32 -i32mean 1.6×10−5 0.016 9.0×10−4 0.9max 1.1×10−3 1.1 1.6×10−2 16

(which executes in the order of hundreds of microseconds), a testbench with just the q3m c kernel was

built and launched repeatedly for 100k steps.

The obtained results are presented in Table 6.9. The first aspect worth noting refers to the configu-

ration that presented the highest average power: the fp64 -fp32 . This fact can be justified by the higher

core occupation allowed with the single precision floating-point implementation. The fp64 version has

a lower average power dissipation due to the opposite reason, i.e. its lower GPU occupation, resulting

in a reduced dynamic power requirement. For the case of the fp32 -i32 configuration, it is expected a

similar GPU occupation relative to fp64 -fp32 , although the integer functional units consume less power,

resulting in a 8W decrease in average power. To complement and further justify these observations,

power and energy consumption were also measured on the avx2-baseline configuration, by using the

SchedMon power and energy measurement tool [50]. Although the avx2-baseline draws (on average)

approximately 4 times less power than the most energy efficient parallel configuration on the K20C

GPU (fp32 -i32 ), the acceleration attained by the GPU in the execution time of the q3m c kernel greatly

compensates this, yielding a much lower overall energy consumption, thus saving up to 28.8× energy.

Although the same tests could not be performed on the GTX680 and GTX780Ti GPUs (these GPUs

do not feature internal power counters), one can predict rather similar energy savings for the GTX780Ti

GPUs accelerator, since it shares the same Kepler core architecture (GK110).

6.5 Summary

In this Chapter, a detailed performance assessment of the devised parallel heterogeneous approach

to the PMC QM/MM method was presented and discussed. The main considered performance metric

of interest was the executed time speed-up, when comparing the parallel solutions to either the avx2-

baseline (a single core of the i7-4770K processor, with AVX2 instructions enabled) or the mcx4-baseline

(a single core of Xeon E5-2609). To accomplish these profiling measurements, several tools were em-

65

Table 6.9: Execution time speed-up, energy savings and average power consumption, when comparingthe Tesla K20C GPU running all the devised numerical precision approaches with avx2-baseline (withthe original fp64 precision). The testbench was run on the K20C GPU for 100k steps, in order to ensurea representative sampling of the computational cost of q3m c. The default core frequency configurationwas used for all experiments.

Setup q3m c time (µs) q3m c speed-up Energy savings Avg. Poweravx2-baseline fp64 76077 1× 1× 34W

K20C fp64 1775 42.9× 10.4× 140W

K20C fp64 -fp32 670 113.5× 26.3× 147W

K20C fp32 -i32 647 117.6× 28.8× 139W

ployed, namely the PAPI [35] library, OpenCL Profiling Events, and the kcachegrind tool (based on

valgrind [37]). The performance of the parallel solution was assessed by using four sets of chemical

datasets, carefully designed by chemical experts from the Institut fur Physikalische Chemie, Georg-

August-Universitat Gottingen. In particular, a chorismate reaction dataset, relevant to the field of appli-

cation [11, 15], was benchmarked. The experiments for this particular dataset yielded 56× execution

time speed-up in the simulation bottleneck (PMC Cycle), and 38× speed-up for the full simulation (when

compared to the avx2-baseline). This is a significant acceleration, since it reduced the full execution

time from ∼80 hours to ∼2 hours. Furthermore, a scalability of 85% was observed for the case of 8

Markov chains executing in a platform with 8 CPU cores and 2 GPUs. Finally, the numerical quality

and energy consumption of the proposed solution were evaluated (by using the SchedMon power and

energy measurement tool [50]) for alternative numerical representation schemes. Energy savings of up

to 28× were observed in the heaviest kernel of the simulation bottleneck.

66

CHAPTER 7

CONCLUSIONS

Contents7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

67

The objective of this MSc thesis was to accelerate the PMC QM/MM algorithm by designing an

efficient and scalable parallel implementation for heterogeneous architectures comprised by a multi-core

CPU and one or more accelerators (e.g., GPUs). In particular, the performance of the devised solution

was to be evaluated in several system configurations, by studying molecular simulations relevant to

the Theoretical Chemistry field of application. The major metric of interest was the obtained speed-

up in respect to the original serial version, although the consumed energy and the resulting numerical

precision were also a target of analysis and discussion. The OpenCL library was chosen as the parallel

framework of choice to allow targeting heterogeneous architectures.

Before parallelizing the target application, an optimized single-core version was developed, in or-

der to have a fair performance baseline. At this respect, optimized data structures and other prelim-

inary optimization schemes were employed. Two main procedures were identified: the QM Update

and the PMC Cycle. Then, a careful study of the available parallelization opportunities was made,

eventually leading to a multi-layered parallel solution, extracting parallelism by: i) running several in-

dependent QM Updates, each correspondent to a Markov chain (chain-level parallelism); ii) executing

the PMC Cycle procedures in parallel in respect to each other (task-level parallelism); iii) executing the

inner iterations of each procedure in parallel, for different sections of the dataset (data-level parallelism).

Considering this approach, the compute devices in the target heterogeneous node architecture were

tasked with different parts of the problem, scheduling the intrinsically serial QM Update processes to the

CPU cores (one instance per core), and the highly task and data-parallel PMC Cycle to the available

OpenCL accelerators. Concurrently to the accelerators, the host CPU manages to ensure a dynamic

load balancing, by distributing the workload of the heaviest kernels among multiple accelerators.

Following, a detailed performance assessment of the devised parallel heterogeneous approach to

the PMC QM/MM method was presented and discussed. As a result, by exploiting the massively parallel

GPU architecture, the computational bottleneck in the original single-core approach was accelerated to

56×, for the case of a well known chorismate dataset. To further promote the scalability of the proposed

implementation, the MC state-space was further sampled using several independent Markov Chains,

which was proved to scale with an efficiency of 85%. In a cumulative perspective, the complete PMC sim-

ulation yielded a speed-up of 38×, effectively reducing the full execution time of the chorismate QM/MM

simulation from ∼80 hours to ∼2 hours, achieving considerable savings in terms of time and energy.

Other chemical benchmarks were evaluated, to assess the particular performance of the PMC Cycle.

For the case of a typical arginine dataset, a speed-up of up to 152× was achieved, when running the

PMC Cycle on two heterogeneous Nvidia GPUs, and up to 111× when using an heterogeneous system

composed by an AMD GPU and an Nvidia GPU.

In conclusion, the proposed objectives to this dissertation were met. The cumulative contributions

of this thesis to the scientific community have resulted in two research articles. One has been already

submitted for publication in an international peer-reviewed journal [33], whereas another is awaiting

submission [15]. In addition, the resulting application is now being actively used by the Free Floater

Research Group - Computational Chemistry and Biochemistry, Institut fur Physikalische Chemie, Georg-

August-Universitat Gottingen, for further scientific studies. The resulting parallel program package will

68

be released under the BSD-3-clause open source licence.

7.1 Future Work

The present dissertation was naturally limited by the time available to develop a MSc thesis. There-

fore, a few optimizations and additional parallel schemes where not considered, either due to lack of

time or to deviation from the thesis scope. At this respect, the following future work is proposed:

(i) In order to exploit multiple nodes in a computing network, an MPI [26] solution could be devised,

to allow running a larger amount of Markov chains. Since a fairly good scalability was attained in

a multi-core CPU environment, and since almost no communication is required between Markov

chains, this approach could achieve good results.

(ii) An approach to simultaneously exploit the FPGA architecture and the CPU and GPU architectures

could be designed. Since OpenCL is now supported by Altera FPGAs, this would pose an interest-

ing scenario. This approach was not considered a priority in this dissertation, since one of the main

objectives was to optimize the developed solution to efficiently run in accelerators commonly found

in computational chemistry research groups (e.g. CPUs and GPUs).

(iii) The energy measurement approach herein considered could be integrated in the load balancing

algorithm, in order to achieve an energy-aware balancing solution. This approach was discarded

for the case of this dissertation, since among the selected hardware platforms, the only GPU able

to perform energy measurements is the Nvidia Tesla K20C.

(iv) Study of a possible grid partitioning approach, such as the neighbor-list or the cell-list schemes

discussed in [4, 49, 56]. However, since for the case of the PMC QM/MM algorithm the QM grid

has to be updated every PMC (outer) iteration (by the QM Update), it is not obvious if a cell-list

or neighbor-list approach would bring additional performance gains, since the overhead of creating

these lists would have to be repeated every PMC iteration.

69

REFERENCES

[1] Alejandro Acosta, Robert Corujo, Vicente Blanco, and Francisco Almeida. Dynamic load balancing

on heterogeneous multicore/multigpu systems. In High Performance Computing and Simulation

(HPCS), 2010 International Conference on, pages 467–476. IEEE, 2010.

[2] Amos G Anderson, William A Goddard III, and Peter Schroder. Quantum monte carlo on graphical

processing units. Computer Physics Communications, 177(3):298–306, 2007.

[3] Joshua A. Anderson, Eric Jankowski, Thomas L. Grubb, Michael Engel, and Sharon C. Glotzer.

Massively parallel monte carlo for many-particle simulations on GPUs. Journal of Computational

Physics, 254:27–38, December 2013.

[4] Joshua A Anderson, Chris D Lorenz, and Alex Travesset. General purpose molecular dynamics

simulations fully implemented on graphics processing units. Journal of Computational Physics,

227(10):5342–5359, 2008.

[5] Cedric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-Andre Wacrenier. StarPU: a

unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and

Computation: Practice and Experience, 23(2):187–198, 2011.

[6] Bernd A Berg and A Billoire. Markov chain monte carlo simulations. World Scient., Singapore,

2004.

[7] Alecio Pedro Delazari Binotto, Carlos Eduardo Pereira, Arjan Kuijper, Andre Stork, and Dieter W

Fellner. An effective dynamic scheduling runtime and tuning system for heterogeneous multi and

many-core desktop platforms. In High Performance Computing and Communications (HPCC), 2011

IEEE 13th International Conference on, pages 78–85. IEEE, 2011.

[8] Martin Burtscher, Ivan Zecena, and Ziliang Zong. Measuring gpu power with the k20 built-in sensor.

In Proceedings of Workshop on General Purpose Processing Using GPUs, page 28. ACM, 2014.

[9] Ricolindo L Carino and Ioana Banicescu. Dynamic load balancing with adaptive factoring methods

in scientific applications. The Journal of Supercomputing, 44(1):41–63, 2008.

71

[10] Long Chen, Oreste Villa, Sriram Krishnamoorthy, and Guang R Gao. Dynamic load balanc-

ing on single-and multi-gpu systems. In Parallel & Distributed Processing (IPDPS), 2010 IEEE

International Symposium on, pages 1–12. IEEE, 2010.

[11] Frederik Claeyssens, Kara E Ranaghan, Narin Lawan, Stephen J Macrae, Frederick R Manby,

Jeremy N Harvey, and Adrian J Mulholland. Analysis of chorismate mutase catalysis by qm/mm

modelling of enzyme-catalysed and uncatalysed reactions. Organic & biomolecular chemistry,

9(5):1578–1590, 2011.

[12] David Clarke, Alexey Lastovetsky, and Vladimir Rychkov. Dynamic load balancing of parallel com-

putational iterative routines on highly heterogeneous hpc platforms. Parallel Processing Letters,

21(02):195–217, 2011.

[13] Kenneth P Esler, Jeongnim Kim, David M Ceperley, and Luke Shulenburger. Accelerating quantum

monte carlo simulations of real materials on gpu clusters. Computing in Science & Engineering,

14(1):40–51, 2012.

[14] Jonas Feldt. Entwicklung einer Storungstheoretischen QM/MM Monte Carlo Methode fur die Studie

von Molekulen in Losung. Master’s thesis, Georg-August-Universitat Gottingen, 2013.

[15] Jonas Feldt, Sebastiao Miranda, Joao C. A. Oliveira, Frederico Pratas, Nuno Roma, Pedro Tomas,

and Ricardo A. Mata. Perturbative monte carlo mixed quantum mechanics/molecular mechanics.

Journal of Chemical Information and Modeling (to be submited).

[16] Christopher J. Fennell and J. Daniel Gezelter. Is the ewald summation still necessary? pairwise al-

ternatives to the accepted standard for long-range electrostatics. The Journal of Chemical Physics,

124(23):234104, 2006.

[17] Mark S. Friedrichs, Peter Eastman, Vishal Vaidyanathan, Mike Houston, Scott Legrand, Adam L.

Beberg, Daniel L. Ensign, Christopher M. Bruns, and Vijay S. Pande. Accelerating molec-

ular dynamic simulation on graphics processing units. Journal of Computational Chemistry,

30(6):864–872, 2009.

[18] George D Geromichalos. Importance of molecular computer modeling in anticancer drug develop-

ment. Journal of BU ON.: Official Journal of the Balkan Union of Oncology, 12:S101, 2007.

[19] Charles J Geyer. Practical markov chain monte carlo. Statistical Science, pages 473–483, 1992.

[20] Walter R Gilks, Sylvia Richardson, and David J Spiegelhalter. Introducing markov chain monte

carlo. In Markov chain Monte Carlo in practice, pages 1–19. Springer, 1996.

[21] Valentin Gogonea, Lance M Westerhoff, and Kenneth M Merz Jr. Quantum mechanical/quantum

mechanical methods. i. a divide and conquer strategy for solving the schrodinger equation for large

molecular systems using a composite density functional–semiempirical hamiltonian. The Journal

of Chemical Physics, 113(14):5604–5613, 2000.

72

[22] Khronos OpenCL Working Group. The OpenCL Specification version 1.2 revision 19, 2012.

[23] Clifford Hall, Weixiao Ji, and Estela Blaisten-Barojas. The metropolis monte carlo method with

CUDA enabled graphic processing units. Journal of Computational Physics, 258:871–879, Febru-

ary 2014.

[24] Intel. Intel SDK for OpenCL* Applications 2013 R2 Optimization Guide, pages 14-15, 2013.

[25] William L. Jorgensen, David S. Maxwell, and Julian Tirado-Rives. Development and testing of the

OPLS all-atom force field on conformational energetics and properties of organic liquids. Journal

of the American Chemical Society, 118(45):11225–11236, January 1996.

[26] Mario Lauria and Andrew Chien. Mpi-fm: High performance mpi on workstation clusters. Journal

of Parallel and Distributed Computing, 40(1):4–18, 1997.

[27] V.I. Lebedev. Values of the nodes and weights of ninth to seventeenth order gauss-markov

quadrature formulae invariant under the octahedron group with inversion. USSR Computational

Mathematics and Mathematical Physics, 15(1):44–51, January 1975.

[28] Arnaud Legrand, Helene Renard, Yves Robert, and Frederic Vivien. Mapping and load-balancing

iterative computations. Parallel and Distributed Systems, IEEE Transactions on, 15(6):546–558,

2004.

[29] Cong Liu, Jian Li, Wei Huang, Juan Rubio, Evan Speight, and Xiaozhu Lin. Power-efficient time-

sensitive mapping in heterogeneous systems. In Proceedings of the 21st international conference

on Parallel architectures and compilation techniques, pages 23–32. ACM, 2012.

[30] Y Lutsyshyn. Fast quantum monte carlo on a gpu. arXiv preprint arXiv:1312.1282, 2013.

[31] Tadaaki Mashimo, Yoshifumi Fukunishi, Narutoshi Kamiya, Yu Takano, Ikuo Fukuda, and Haruki

Nakamura. Molecular dynamics simulations accelerated by GPU for biological macromolecules with

a non-ewald scheme for electrostatic interactions. Journal of Chemical Theory and Computation,

9(12):5599–5609, December 2013.

[32] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward

Teller. Equation of state calculations by fast computing machines. The Journal of Chemical Physics,

21:1087, 1953.

[33] Sebastiao Miranda, Jonas Feldt, Frederico Pratas, Ricardo Mata, Nuno Roma, and Pedro Tomas. A

parallel heterogeneous approach to perturbative monte carlo qm/mm simulations. Journal of High

Performance Computing Applications (submited).

[34] Lubos Mitas. Diffusion monte carlo. Quantum Monte Carlo Methods in Physics and Chemistry,

525:247, 1998.

73

[35] Philip J Mucci, Shirley Browne, Christine Deane, and George Ho. Papi: A portable interface to

hardware performance counters. In Proceedings of the Department of Defense HPCMP Users

Group Conference, pages 7–10, 1999.

[36] Michael E. Mura and Peter J. Knowles. Improved radial grids for quadrature in molecular density-

functional calculations. The Journal of Chemical Physics, 104(24):9848–9858, June 1996.

[37] Nicholas Nethercote and Julian Seward. Valgrind: A program supervision framework. Electronic

notes in theoretical computer science, 89(2):44–66, 2003.

[38] Matıas A Nitsche, Manuel Ferreria, Esteban E Mocskos, and Mariano C Gonzalez Lebrero. Gpu

accelerated implementation of density functional theory for hybrid qm/mm simulations. Journal of

Chemical Theory and Computation, 10(3):959–967, 2014.

[39] NVIDIA. Kepler GK110, version 1.0, 2012.

[40] NVIDIA. CUDA C Programming Guide, version v5.5, 2013.

[41] CUDA NVIDIA. Gpu occupancy calculator. CUDA SDK, 2010.

[42] Robert E Overman, Jan F Prins, Laura A Miller, and Michael L Minion. Dynamic load balanc-

ing of the adaptive fast multipole method in heterogeneous systems. In Parallel and Distributed

Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International, pages

1126–1135. IEEE, 2013.

[43] John P. Perdew, Kieron Burke, and Matthias Ernzerhof. Generalized gradient approximation made

simple. Physical Review Letters, 77(18):3865–3868, October 1996.

[44] Frederico Pratas, Leonel Sousa, Johannes M. Dieterich, and Ricardo A. Mata. Computation of in-

duced dipoles in molecular mechanics simulations using graphics processors. Journal of Chemical

Information and Modeling, 52(5):1159–1166, May 2012.

[45] Giuseppe Scarpa, Raffaele Gaetano, Michal Haindl, and Josiane Zerubia. Hierarchical multiple

markov chain model for unsupervised texture segmentation. Image Processing, IEEE Transactions

on, 18(8):1830–1843, 2009.

[46] ”Desh Singh, Tom Czajkowski, and Altera Corporation” Andrew Ling. Tutorial: Harnessing the

Power of FPGAs using Altera’s OpenCL Compiler, 2013.

[47] Kyle Spafford, Jeremy Meredith, and Jeffrey Vetter. Maestro: data orchestration and tuning for

opencl devices. In Euro-Par 2010-Parallel Processing, pages 275–286. Springer, 2010.

[48] John E. Stone, James C. Phillips, Peter L. Freddolino, David J. Hardy, Leonardo G. Trabuco, and

Klaus Schulten. Accelerating molecular modeling applications with graphics processors. Journal of

Computational Chemistry, 28(16):2618–2640, 2007.

74

[49] Alfeus Sunarso, Tomohiro Tsuji, and Shigeomi Chono. GPU-accelerated molecular dynamics sim-

ulation for study of liquid crystalline flows. Journal of Computational Physics, 229(15):5486–5497,

2010.

[50] Luis Tanica, Aleksandar Ilic, Pedro Tomas, and Leonel Sousa. Schedmon: A performance and

energy monitoring tool for modern multi-cores. 7th International Workshop on Multi/many-Core

Computing Systems (MuCoCus’2014, 2014.

[51] Thanh N. Truong and Eugene V. Stefanovich. Development of a perturbative approach for monte

carlo simulations using a hybrid ab initio QM/MM method. Chemical Physics Letters, 256(3):348–

352, June 1996.

[52] Yutaka Uejima, Tomoharu Terashima, and Ryo Maezono. Acceleration of a qm/mm-qmc simulation

using gpu. Journal of Computational Chemistry, 32(10):2264–2272, 2011.

[53] Bart Verleye, Pierre Henri, Roel Wuyts, Giovanni Lapenta, and Karl Meerbergen. Implementa-

tion of a 2d electrostatic particle-in-cell algorithm in unified parallel c with dynamic load-balancing.

Computers & Fluids, 80:10–16, 2013.

[54] Florian Weigend and Reinhart Ahlrichs. Balanced basis sets of split valence, triple zeta valence and

quadruple zeta valence quality for h to rn: Design and assessment of accuracy. Physical Chemistry

Chemical Physics, 7(18):3297–3305, August 2005.

[55] H.-J. Werner, P. J. Knowles, G. Knizia, F. R. Manby, M. Schutz, et al. MOLPRO, version 2012.1, a

package of ab initio programs. molpro, 2012. see www.molpro.net.

[56] Zhenhua Yao, Jian-Sheng Wang, Gui-Rong Liu, and Min Cheng. Improved neighbor list algorithm

in molecular simulations using cell decomposition and data sorting method. Computer Physics

Communications, 161(1):27–35, 2004.

[57] Weihang Zhu and Yaohang Li. Gpu-accelerated differential evolutionary markov chain monte carlo

method for multi-objective optimization over continuous space. In Proceedings of the 2nd workshop

on Bio-inspired algorithms for distributed systems, pages 1–8. ACM, 2010.

75

www.molpro.net