A Parallel Heterogeneous Approach toPerturbative Monte Carlo QM/MM Simulations
Sebastiao Salvador de Miranda
Thesis to obtain the Master of Science Degree in
Electrical and Computer Engineering
Supervisors: Dr. Pedro Filipe Zeferino Tomas,
Dr. Nuno Filipe Valentim Roma
Examination Committee
Chairperson: Dr. Nuno Cavaco Gomes HortaSupervisor: Dr. Pedro Filipe Zeferino Tomas
Members of the Committee: Dr. Gabriel Falcao Paiva Fernandes
October 2014
Acknowledgments
Foremost, I would like to thank my supervisors, Doctor Pedro Tomas, Doctor Nuno Roma and Doctor
Frederico Pratas, who have provided me with invaluable guidance. I would also like to thank Doctor
Gabriel Falcao, who has reviewed the intermediate report of this dissertation and provided several in-
sightful comments.
I would like to express my gratitude to Doctor Ricardo Mata, who enlighted me on several occasions
about computational chemistry aspects, and invited me to spend a very pleasant month of research at
the Free Floater Research Group, Institut fur Physikalische Chemie, Georg-August-Universitat Gottin-
gen, Germany. Furthermore, I would like to thank Jonas Feldt, who helped me to achieve a greater
understanding of the PMC QM/MM simulation method, and with whom I have intensively collaborated
in writing research articles and developing new simulation features. I would also like to thank my col-
leagues Tomas Ferreirinha, David Nogueira, Francisco Gaspar, Andriy Gorobets and Joao Silva, with
whom I have discussed a multitude of topics, doubts and ideas, during the development of my disser-
tation. Furthermore, I would like to thank Joao Guerreiro and Luıs Tanica for having helped me in the
development of power and energy measurement techniques.
Special thanks to my girlfriend Mafalda Coelho, who has endured several months of listening to dry
technical details about my dissertation. I would also like to thank my father Pedro Miranda and my mother
Ana Salvador, for having discussed with me several topics on matters of biology, chemistry, physics and
computation.
Finally, I would like to express my gratitude to INESC-id and Institut fur Physikalische Chemie for
having given me access to their infrastructure, namely their high performance computing platforms.
Furthermore, the work presented herein was partially supported by national funds through Fundacao
para a Ciencia e a Tecnologia (FCT) under projects Threads (ref. PTDC/EEA-ELC/117329/2010) and
P2HCS (ref. PTDC/EEI-ELC/3152/2012).
ABSTRACT
Molecular simulations play an increasingly important role in computational chemistry, computational
biology and computer aided drug design. However, traditional single core implementations hardly sat-
isfy the current needs, due to the prolonged runs that often arise for not exploiting the intrinsic data and
task parallelism of some of these methods. To address this limitation, a new heterogeneous parallel
solution to Monte Carlo (MC) molecular simulations is herein introduced, exploiting fine-grained par-
allelism in the inner structure of the bottleneck procedures, and coarse-grained parallelism in the MC
state-space sampling. Unlike typical high-performance pure Quantum Mechanics (QM) or Molecular
Mechanics (MM) parallelization approaches, the work herein presented focuses on accelerating a novel
Perturbative Monte Carlo (PMC) mixed QM/MM application. The hybrid nature of the proposed parallel
approach warrants an efficient use of heterogeneous systems, composed by single or multiple CPUs
and heterogeneous accelerators (e.g., GPUs), by relying on the multi-platform OpenCL programming
framework. To efficiently exploit the parallel architecture, load balancing schemes were employed to
schedule the work between the available accelerators. A speed-up of 56× is achieved in the compu-
tational bottleneck for the case of a relevant chorismate dataset, when compared with an optimized
single-core implementation. A speed-up of 38× is observed for the full simulation, using both multi-core
CPUs and GPUs, thus effectively reducing the execution time of the full simulation from ∼80 hours to ∼2
hours.
Keywords
Quantum Mechanics (QM), Molecular Mechanics (MM), Monte Carlo (MC) Simulations, Parallel
Computing, Heterogeneous Computing, OpenCL.
iii
RESUMO
As simulacoes moleculares desempenham um papel cada vez mais importante na quımica e bi-
ologia computacionais e no desenvolvimento de farmacos assistido por computador. No entanto, as
implementacoes tradicionais single-core tem execucoes muito prolongadas, nao aproveitando o para-
lelismo de dados e de tarefas intrinsecamente presente nalguns destes metodos. De forma a colmatar
esta limitacao, este trabalho introduz uma solucao paralela e heterogenea para simulacoes moleculares
Monte Carlo (MC), explorando o paralelismo fine-grained na estrutura interna do bottleneck computa-
cional e o paralelismo coarse-grained na amostragem do espaco de estados de MC. Ao contrario de
abordagens tıpicas de alta performance a algoritmos puros de Quantum Mechanics (QM) ou Molecular
Mechanics (MM), este trabalho concentra-se na aceleracao de uma novo metodo Perturbative Monte
Carlo (PMC) mixed QM/MM. A natureza hıbrida da abordagem paralela proposta permite o uso de ar-
quiteturas heterogeneas, compostas por um ou varios CPUs e aceleradores heterogeneos (e.g. GPUs),
tirando partido da biblioteca multi-plataforma OpenCL. De forma a explorar eficazmente arquiteturas he-
terogeneas, foram aplicados esquemas de Load Balancing para distribuir a carga computacional pelos
aceleradores disponıveis. E atingido um speed-up de 56× no bottleneck computacional para o caso
de um chorismate dataset relevante na area, quando comparado com uma implementacao single-core
otimizada. No caso da simulacao completa, e observado um speed-up de 38×, tirando partido de multi-
core CPUs e GPUs. O tempo total desta simulacao foi assim reduzido de ∼80 horas para ∼2 horas.
Palavras Chave
Mecanica Quantica, Mecanica Molecular, Simulacoes Monte Carlo, Computacao Paralela, Computacao
Heterogenea, OpenCL
v
CONTENTS
1 Introduction 11.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Main Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Heterogeneous Computing 72.1 Multi-Core General-Purpose Processors (GPP) Architecture . . . . . . . . . . . . . . . . . 82.2 Graphical Processing Unit (GPU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 AMD and Nvidia Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Platform Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.2 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.3 Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.4 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.5 OpenCL Runtime Parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Load Balancing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Perturbative Monte Carlo QM/MM 183.1 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Computational Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Data Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Parallel Heterogeneous Solution 284.1 Original PMC QM/MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Exploiting Markov Chain Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.1 Multiple Markov Chain Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3 Parallelization Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.1 OpenCL Host-Side Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3.1.A Load Balancing Among Multiple Markov Chains . . . . . . . . . . . . . . 34
4.4 Data Structure Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
vii
4.4.1 Indexing Molecules and Atoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.4.2 Computing Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 Fine-Grained Parallelism and Multi-Device Load Balancing 405.1 PMC Cycle Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1.1 Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.1.2 Coulomb Grid QM/MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.1.3 Coulomb/VDW MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.1.4 Coulomb Nuclei/VDW QM/MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.1.5 Decision Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Exploiting Single Markov Chain Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 475.2.1 Multiple OpenCL Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.2.2 Dynamic Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.2.A Problem Partitioning Approaches . . . . . . . . . . . . . . . . . . . . . . 515.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6 Experimental Evaluation 526.1 Benchmarking Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.1.1 Chemical Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.1.2 Hardware Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.1.3 Performance Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2 PMC Cycle Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.2.1 PMC Cycle Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.2.2 PMC Cycle Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.3 Global PMC Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.4 Numerical Evaluation: Convergence Accuracy and Energy Consumption . . . . . . . . . . 636.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7 Conclusions 667.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
viii
LIST OF FIGURES
2.1 Example CPU with 4 cores and 3 levels of cache. . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Organization of the AMD Southern Islands GPU architecture. . . . . . . . . . . . . . . . . 10
2.3 GTX 680 device Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 OpenCL Platform Model [22]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Partitioning of work-items into work-groups. . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Partitioning of work-items into work-groups. . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Example of an heterogeneous network composed by several compute nodes, each com-prised by multiple Central Processing Unit (CPU) cores and one or more specialized ac-celerators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 A system composed of one QM molecule (C) and two MM solvent molecules (A and B).For each MC step, the difference in energy between the molecule moved (A) and everyother molecule has to be computed, but at different levels of theory. . . . . . . . . . . . . 20
3.2 Perturbative Monte Carlo QM/MM with focus on the simulation bottleneck (PMC cycle,right). Arrows represent data dependencies. . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Main data structures used in the PMC Cycle. Refer to Table 3.1 for parameter definitions. 24
3.4 Data dependencies within the PMC Cycle. The VDW QMMM and Coulomb Nuclei QMMMprocesses only read the atoms that are part of the QM molecule, not the whole lattice. . . 25
4.1 Independent MC state-space exploration chains (illustrative example for 2 chains), eachgenerating an independent sampling of the conformal space of the target QM/MM system. 30
4.2 MC State-Space alongside with the execution timeline for three Markov chains. . . . . . . 31
4.3 Simultaneous exploitation of chain-level, task-level and data-level parallelism in the PMCQM/MM method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 Multi-process/multi-threading structure of the designed parallel solution for the PMC method(right), alongside with the original dual-process approach (left). . . . . . . . . . . . . . . . 33
4.5 Program flow of the devised parallel PMC program, for the case of a single-device single-process instance (in order to keep the illustration clear). The legend for the numberedparts of this figure is presented throughout the text. . . . . . . . . . . . . . . . . . . . . . . 35
4.6 mol2atom data structure, together with the lattice vectors. The mol2atom structure returnsthe index of the first atom belonging to the target molecule, which can then be used toindex the lattice vectors, which contain the {x, y, z, σ, ε, q} data. . . . . . . . . . . . . . . . 37
ix
4.7 Original approach to distance computation (left), together with the devised on-the-fly so-lution (right). For the sake of clarity, the distance computation procedures were singledout, although they are executed in the same computation loop as the Coulomb/VDW pro-cedures. The remaining procedures of the PMC Cycle step have been omitted for thesake of clarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1 Mapping of the PMC Cycle procedures into OpenCL Kernels. It should be noticed thatsome procedures were merged into the same kernel. Furthermore, the OpenCL versionrequires additional kernels for the parallel reductions (mm finish and q3m finish, markedwith a ∗). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Memory layout example for the main data structures used in the PMC Cycle. . . . . . . . 42
5.3 Diagram for the devised monte carlo kernel, together with the layout of the data which ismanipulated in this procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4 Scheme used for partitioning the grid among the work-groups, in order to allow a coa-lesced memory access pattern. For the sake of keeping the illustration clear, an examplefor P = 2 and wgsize = 4 is shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.5 q3m c and q3m finish kernels structure. In this example, work-group 0 was presentedwith additional detail, although all work-groups share an identical structure. Likewise, the8 work-items per work-group configuration was adopted for simpler illustrative purposes,as the work-group size is fully parameterizable. Furthermore, additional details concern-ing the first global memory accesses (label 1) are depicted in Figure 5.4. . . . . . . . . . . 45
5.6 q3m vdwc kernel structure. An 8 work-items per work-group configuration was adoptedfor simpler illustrative purposes, as the work-group size is fully parameterizable. . . . . . . 46
5.7 decide update kernel diagram. An 8 work-items per work-group configuration was adoptedfor simpler illustrative purposes, as the work-group size is fully parameterizable. . . . . . . 47
5.8 Exploiting multiple heterogeneous OpenCL devices to execute the PMC Cycle. The exe-cution is balanced by executing different kernels on each device and dividing the work ofthe heavier kernels (q3m c and q3m reduce). . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.9 Work-flow of the centralized predicting-the-future dynamic load balancing solution em-ployed in this dissertation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.1 Time footprint for a single PMC Cycle step for the bench-A dataset running on the avx2-baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 One complete PMC outer iteration, comprised of 10k PMC Cycle steps and a QM Up-date, for the bench-A dataset running on the avx2-baseline. The bottleneck of each PMCiteration is the PMC Cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3 Speed-up obtained for a PMC Cycle with 10k iterations, when using fp64 -fp32 mixed-precision. The Corresponding execution times are presented in Table 6.3. . . . . . . . . . 57
6.4 OpenCL kernel timings (per step) for the PMC Cycle running on the mcx2 heterogeneousplatform. The load is balanced for the heavier kernels (q3m c/q3m finish, correspondingto Coulomb QM/MM), whereas the lighter kernels were scheduled to the first GraphicalProcessing Unit (GPU). The considered benchmark is bench-A, using mixed fp64 -fp32
precision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.5 Convergence pattern of the implemented load balancing algorithm (balancing every 2000steps), for the Bench-C running on the GTX 780Ti/660Ti platform (mcx2). The presentedPMC cycle time measurements represent mean times since the previous balancing. . . . 59
x
6.6 Scalability of the PMC Cycle when changing the size of the QM part in bench-A. Speed-upresults are presented for a dual GTX680 system in respect to a single GTX680 (platformmcx4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.7 QM/MM Simulation box for the bench-R dataset (partial representation), together with thesimulation results for the conversion of the chorismate structure into prephenate. . . . . . 61
xi
xii
LIST OF TABLES
3.1 QM/MM Run Characterization, together with the typical parameter range for the bench-marks considered in this work. For the case of homogeneous solvents, the Z(i)
MM param-eter (concerning molecule i) will be the same for every MM molecule. . . . . . . . . . . . 21
5.1 Complexity of communication and synchronization overheads, in respect to the QM/MMsystem characteristics and to run parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1 Considered QM/MM benchmark datasets. The chemical aspects of bench-R are pre-sented in detail in [11]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Considered execution platforms in the experimental evaluation. . . . . . . . . . . . . . . . 54
6.3 Execution time (in seconds) for a PMC Cycle with 10k steps, in several hardware plat-forms, when using fp64 -fp32 mixed-precision. The column ”Total” corresponds to thecomplete execution times of the PMC Cycle (10k steps), including the final serial over-head of reading back and writing the output to a file. This overhead is discriminated incolumn ”Output”. The presented execution times correspond to a median among fourexperimental trials, for each platform configuration. . . . . . . . . . . . . . . . . . . . . . . 56
6.4 Kernel execution times obtained in the GTX780Ti accelerator and the in the referenceavx2-baseline platform, for the particular case of bench-A. The speed-up in respect to theavx2-baseline is also presented, together with the fraction of the PMC Cycle (%) eachkernel represents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.5 bench-R execution time for the PMC Cycle (50k steps) and QM Update (24.8M iters)stages, as well as for the full PMC simulation. The presented results consider two base-lines and four parallel solutions, with either a single or 8 Markov chains and fp64 orfp64 -fp32 precision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.6 Performance speed-ups for bench-R, considering the execution times presented in Ta-ble 6.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.7 Speed-up of the mixed precision q3m c kernel versions versus the original fp64 version,running on the same machine, for the case of bench-A. . . . . . . . . . . . . . . . . . . . 64
6.8 Obtained numerical precision. The error is shown for the ∆ECQM/MM energy term, as well
as for the total energy of the system (E), when considering the em = 1.0 × 10−1kJ/mol
maximum error. The average values were taken from the complete set of generatedQM/MM systems, by using bench-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
xiii
6.9 Execution time speed-up, energy savings and average power consumption, when com-paring the Tesla K20C GPU running all the devised numerical precision approaches withavx2-baseline (with the original fp64 precision). The testbench was run on the K20C GPUfor 100k steps, in order to ensure a representative sampling of the computational cost ofq3m c. The default core frequency configuration was used for all experiments. . . . . . . 65
xiv
LIST OF ACRONYMS
MD Molecular Dynamics
MC Monte Carlo
PMC Perturbative Monte Carlo
DMC Diffusion Monte Carlo
VMC Variational Monte Carlo
AFMM Adaptive Fast Multipole Method
MM Molecular Mechanics
QM Quantum Mechanics
QMC Quantum Monte Carlo
vdW Van der Waals
CPU Central Processing Unit
GPU Graphical Processing Unit
DSP Digital Signal Processor
FPGA Field Programmable Gate Array
ILP Instruction Level Parallelism
SIMD Single Instruction Multiple Data
MIMD Multiple Instruction Multiple Data
SPMD Single Program Multiple Data
PE Processing Element
CU Computing Unit
GPC Graphics Processing Cluster
PC Program Counter
xv
LDS Local Data Share
SI Southern Islands
SMX Streaming Multiprocessor
CC Compute Capability
HPC High Performance Computing
xvi
CHAPTER 1
INTRODUCTION
Contents1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Main Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1
Computer simulations have become standard tools in chemical research, allowing for the prediction
of complex molecular structures, together with a comprehensive characterization of their properties.
Using methods from theoretical chemistry, where mathematics and physics are used to study chemi-
cal processes, computational chemistry studies the properties of a chemical system, describing inter
molecule interaction, geometrical arrangements and other chemically related problems. A particular
case of molecular computer simulation is drug docking simulation, which is a method that predicts the
preferred configurations of two molecules when bounding to each other and plays a crucial role in the
lengthy process of computer-aided drug design [18], to which thousands of lives are tied.
In molecular computer simulations, execution time and memory scale rapidly with the size of the
system being simulated, leading to prolonged runs (sometimes in the order of weeks or months) and
resulting in wasted time and energy. Advances in this field are not only due to recent developments of
physical models, but also to the advances in computing systems, which substantially reduce computa-
tional time. Mature implementations of computational science software are usually highly optimized for
traditional single core Central Processing Unit (CPU) architectures, and therefore are intrinsically limited
by advances in single core execution time. To tackle this limitation, more recent High Performance Com-
puting (HPC) solutions have been exploiting the advances in parallel and heterogeneous computing,
using parallel platforms such as multi-core CPUs and many-core Graphical Processing Units (GPUs)
and specialized accelerators such as Field Programmable Gate Arrays (FPGAs) and Digital Signal Pro-
cessors (DSPs).
Molecular simulations are commonly based on Molecular Dynamics (MD) or on the Monte Carlo (MC)
method. MD simulates the system by calculating the forces acting on each atom, applying classical
mechanics to compute the resulting velocities, which are subsequently used to evolve the system in time.
MD allows the study of a wide range of dynamical properties, such as the conformational landscape
of a molecule. However, usable results can only be obtained by using very small simulation steps
(in the order of the femtosecond), which limits the system simulation to the order of microseconds.
Conversely, the Metropolis MC method [32] samples the system in the ensemble space, rather than
following a time coordinate. With this method, a sequence of random configurations is obtained on
the basis of Maxwell-Boltzmann statistics, by performing random movements at each frame and by
evaluating the corresponding change of the system energy. The resulting set is then analysed from
the perspective of the specific thermodynamic property under consideration. Even though MC does not
enable the computation of dynamical quantities, it allows studying processes with longer timescales, for
which sampling in time would be unfeasible.
Accordingly, the underlying method for calculating the energy of a given molecular structure can vary
with the system and the properties under study. The choice can fall to traditional Molecular Mechan-
ics (MM), Quantum Mechanics (QM) or mixed QM/MM methods. MM approaches represent atoms and
molecules through ball and spring models, with heavily parameterized functions to describe their inter-
actions. However, such approach may lead to several limitations. For example, atomic bonds have to be
kept throughout each simulation, thus preventing the chemical reaction to be modeled in a single run.
Alternatively, QM approaches explicitly simulate the electrons, at a cost of a much higher computational
2
burden, as it involves obtaining approximate solutions to the Schrodinger equation [21]. Furthermore,
the computation cost of most QM methods scales exponentially with the system size, thus impeding the
modelling of more complex structures. An alternative solution consists of a mixed QM/MM approach,
which combines the strengths of each method. In this case, a small active region is simulated with
QM, while the remaining environment is represented by classical MM. Nevertheless, the combination
of the mixed QM/MM terms with the pure QM and MM terms that co-exist in this approach usually re-
sult in a very computationally diverse algorithm containing both heavy single threaded code and several
opportunities to exploit task and data parallelism.
Ongoing collaboration between INESC-ID, Instituto Superior Tecnico, Universidade de Lisboa, and
the Institut fur Physikalische Chemie, Georg-August-Universitat Gottingen, led to this work, with the ob-
jective of accelerating their novel algorithm for Perturbative Monte Carlo (PMC) mixed QM/MM simulation
of periodic systems. Generally, the purpose of this model is to explicitly describe a system composed
of a solvent and a solute, by using the Metropolis MC sampling and a mixed QM/MM method for the
energy calculation [14]. An important application of such a strategy is the simulation of the docking of
drugs in the active site of a protein using QM, taking the surrounding environment into account explicitly
using MM. However, the original serial implementation of the PMC QM/MM suffers from extremely long
execution times, thus severely limiting the theoretical chemistry research on large (and realistic) QM/MM
systems.
1.1 Objectives
The objective of this MSc thesis is to accelerate the PMC QM/MM algorithm by designing an efficient
and scalable parallel implementation for heterogeneous architectures comprised by multi-core CPUs
and GPUs. Furthermore, performance will be evaluated in several system configurations, by studying
molecular simulations relevant to the Theoretical Chemistry field of application. The obtained speed-
up in respect to the original serial version will be the major metrics of interest, although the consumed
energy and the resulting numerical precision will also be discussed. Following are the fundamental
objectives of this work:
(i) Devise a parallel approach for the Perturbative Monte Carlo QM/MM simulation method.
(ii) Enable the efficient exploitation of heterogeneous hardware platforms.
(iii) Ensure a good scalability of the devised solution with the available computational resources.
(iv) Assess the performance of the developed solution.
3
1.2 Main Contribution
By addressing the objectives presented in section1.1, the main contributions of this work are the
following:
(i) First parallel heterogeneous solution for the Perturbative Monte Carlo QM/MM method. The de-
vised approach uses the OpenCL framework to parallelize the bottleneck procedures of the PMC
algorithm, enabling computational chemistry researchers to use a wider variety of platforms (in
comparison to when using CUDA or other vendor-specific frameworks).
(ii) Acceleration procedure based on a simultaneous exploitation of fine-grained (at the data level),
course-grained (at the Markov chain level) and task-grained (pure QM, pure MM and QM/MM pro-
cedures) parallelism to achieve an heterogeneous solution for platforms composed by Multi-Core
CPUs and GPUs. Furthermore, a performance-aware dynamic load balancing algorithm was em-
ployed to enable the full exploitation of the computing power of all the heterogeneous devices in a
given heterogeneous platform.
(iii) Parallel method for sampling the MC state-space, by using a multiple Markov chain exploration
scheme to effectively exploit course-grained parallelism in the available CPUs cores. This solution
proved to scale with an efficiency of about 85%. Furthermore, each group of CPU cores can share
one or more GPU accelerators, which run the simulation bottleneck (PMC Cycle) with a speed-up
ranging from about 13× to 152×.
(iv) Evaluation of energy saving and acceleration opportunities based on the adaptation of the numeri-
cal precision used by the algorithm, when considering either double or single-precision floating-point
or fixed-point representations. Such study was integrated in the analysis of performance, numerical
quality and power. The devised mixed precision solutions offer up to 2.7× speed-up and save up to
2.8× energy in the bottleneck kernels, when comparing to the double precision version. In respect
to the baseline PMC implementation, energy savings reach up to 28.8×.
(v) Assessment of the quality of the devised solution with several benchmarks relevant to the Theoret-
ical Chemistry field of application. The designed parallel approach was tested on several different
system configurations composed by Nvidia GPUs, AMD GPUs and Intel CPUs, resulting in the
same chemical results as the original serial implementation, with numerical differences far below
the maximum acceptable error. For the longest QM/MM simulation herein discussed, the parallel
solution effectively reduced the full execution time of the PMC from ∼80hours to ∼2hours.
The cumulative contributions of this thesis to the scientific community have resulted in two research
articles. The first has already been submitted for publication in an international peer-reviewed journal,
whereas the second is awaiting submission:
• Sebastiao Miranda, Jonas Feldt, Frederico Pratas, Ricardo Mata, Nuno Roma, and Pedro Tomas,
”Efficient Parallelization of Perturbative Monte Carlo QM/MM Simulations in Heterogeneous Plat-
forms”, International Journal of High Performance Computing Applications (submitted).
4
• Jonas Feldt, Sebastiao Miranda, Joao C. A. Oliveira, Frederico Pratas, Nuno Roma, Pedro Tomas,
Ricardo A. Mata, ”Perturbative Monte Carlo mixed Quantum Mechanics/Molecular Mechanics”
Journal of Chemical Information and Modeling (to be submitted).
In addition, the resulting application is now being actively used by the Free Floater Research Group -
Computational Chemistry and Biochemistry, Institut fur Physikalische Chemie, Georg-August-Universitat
Gottingen, for further scientific studies. The resulting parallel program package will be released under
the BSD-3-clause open source licence.
1.3 Document Outline
In Chapter 2, an overview of the current state-of-the-art CPU and GPU hardware is presented, as
well as a review of the literature on load balancing algorithms and a description of the OpenCL frame-
work. Chapter 3 presents a detailed description of the PMC QM/MM algorithm from a computational
point of view, and includes a discussion on the related work on accelerating computational chemistry
algorithms. In Chapter 4, a multi-device heterogeneous solution is introduced, and the strategy for ex-
ploiting multiple GPUs and CPU cores to execute multiple Markov chains is presented. In Chapter 5, the
developed OpenCL approach to the simulation bottleneck (PMC Cycle) is discussed in detail, as well as
a dynamic load balancing solution. In Chapter 6, the performance of the developed solution is evaluated
with a set of chemical benchmarks over a wide range of hardware configurations. Furthermore, an anal-
ysis of the scalability is made for both single and multiple Markov Chain solutions, and the numerical
representation impact on the execution time, numerical quality and energy consumption is analyzed.
Finally, in Chapter 7, the conclusions of the presented work are drawn and the future work is discussed.
5
6
CHAPTER 2
HETEROGENEOUS COMPUTINGARCHITECTURES
Contents2.1 Multi-Core General-Purpose Processors (GPP) Architecture . . . . . . . . . . . . . . 82.2 Graphical Processing Unit (GPU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Load Balancing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7
Due to computational requirements, CPU and GPU architectures have been evolving in different
directions, having different advantages and disadvantages. Notwithstanding, heterogeneous systems
composed of both CPUs and GPUs can take advantage both devices to accelerate the execution of a
program. In particular, the GPU architecture provides many cores, each substantially simpler than a
CPU core, trading off single-thread for multi-thread performance. The GPU achieves high-throughput
by hiding thread memory access latency with intensive arithmetic operations from other threads, and by
rapidly switching execution context between groups of threads. It should be noticed that context switch-
ing has very little cost in comparison to CPU threads because hundreds of thread contexts are stored
on-chip1. Conversely, state-of-the-art multi-core CPU architectures offer a few but highly complex cores,
using techniques to exploit Instruction Level Parallelism (ILP) and multiple levels of caches to accelerate
main memory access. This higher complexity results in increased area and power consumption, which
allows only a small set of cores to co-exist in a single die.
When targeting a CPU/GPU heterogeneous environment, the application must be carefully analyzed
and partitioned to efficiently take advantage of both devices. Code with intensive flow-control or low
threaded data and functional parallelism should be kept on the CPU, whereas arithmetic intensive large
data parallel code should be executed in the GPU. Furthermore, the applied partitioning should in
general minimize communication between the CPU and the GPU. To design an efficient work-load
partitioning, a load balancing solution may be devised. The OpenCL framework does dot offer any
intrinsic tools for dealing with work-load scheduling between compute devices, and as such one must
implement a balancing approach fit for the application at hand. At this respect, several authors have
studied work-load scheduling to complement standard heterogeneous frameworks, such as libraries for
the CUDA framework [10], [1], [7] and the Maestro library for the OpenCL framework [47]. Nonetheless,
other load balancing alternatives [9, 10, 12, 28, 42, 53] are discussed in Section 2.4.
In this chapter, an overview of both state-of-the-art CPU and GPU is presented, followed by an
introduction to the OpenCL programming framework, and a review of the literature on load balancing
solutions for heterogeneous platforms.
2.1 Multi-Core General-Purpose Processors (GPP) Architecture
State-of-the-art mainstream multi-core CPU architectures offer a few but highly complex CPU cores.
Very fast memory is available through the use of registers local to each core and access to the larger
but slower main memory is made via several levels of caches. In Figure 2.1, a typical example of a
multi-core CPU architecture with three layers of cache (2 private and 1 shared) is displayed. Several
hardware techniques are employed to accelerate single-threaded execution, such as increasing the clock
frequency through multi-stage hardware pipelining, resulting in higher instruction throughput. Further-
more, modern architectures exploit ILP using super-scalar and out-of-order instruction execution. While
the former allows executing instructions in parallel in the available functional units (in case of no data,
control or structural hazard), the latter enables switching the order of independent instructions to reduce
1For the case of the NVIDIA GK104/GK110 architectures, the maximum number of resident threads per Multi-processor is 2048 [39].
8
processor stalls. Program flow control overhead is mitigated by branch prediction hardware and ulti-
mately by allowing speculative execution. Latency caused by inevitable processor stalls may be further
hidden by hardware multi-threading, allowing simultaneous execution of different threads in the same
processor.
Multi-core CPU
Core
L1
L2
Core
L1
L2
Shared L3
Core
L1
L2
Core
L1
L2
Memory Interface
Figure 2.1: Example CPU with 4 cores and 3 levels of cache.
Higher hardware complexity results in increased area and power consumption per core, which is the
reason why multi-core CPUs only include a small number of cores relatively to GPUs. This means that
the type of parallelism that can be extracted from a multi-core CPU architecture is also more coarse
grained, typically leading to the application of the Multiple Instruction Multiple Data (MIMD) parallel
programming paradigm. Furthermore, Communication between threads in different CPU cores is much
more expensive than communication between threads in cores of the same thread-block in a GPU. In
modern CPUs, each core offers Single Instruction Multiple Data (SIMD) instructions that enable the
extraction of fine-grained parallelism (e.g. Intel SSE/AVX instructions), making the multi-core CPU a
very versatile parallel platform. Although it does not match the GPU in terms of floating point operations
per second for highly data parallel applications, it is more efficient for algorithms with complex control
flow or very coarse grained parallel structure.
2.2 Graphical Processing Unit (GPU)
Graphical Processing Units are designed to accelerate graphic computations. However, due to the
inherent complexity of designing highly efficient dedicated architectures with support for a large number
of operations, significant design changes have been made. Accordingly, GPU vendors started to intro-
duce programmable vertex and pixel shaders. Along the past years, the programmability support has
been substantially increased, allowing for General Purpose computing on Graphics Processing Units
(GPGPU). In the meanwhile, to facilitate programmability, both Nvidia and AMD released proprietary
GPU programming languages, respectively CUDA and CTM (although lately AMD has embraced the
OpenCL open standard, which is also supported in Nvidia GPUs, Intel CPUs and embedded GPUs,
and a multitude of other devices). To better understand the architectural differences between CPUs
and GPUs, in this section an overview of AMD’s Southern Islands and Nvidia’s Kepler architectures is
presented.
9
2.2.1 AMD and Nvidia Architectures
Figure 2.2 depicts the AMD Southern Islands (SI) GPU architecture (HD7000 family). This architec-
ture is composed of several Computing Units (CUs). Each CU has one scalar unit and 4 vector units
composed of an array of 16 processing elements (PEs) each. Local to each CU, there are also five banks
of vector and scalar General Purpose Registers (vGPR/sGPR) and Local Data Share (LDS) memory.
The instruction issue takes four cycles where the four 16-Processing Element (PE) arrays execute 64
work-items in total. The resulting 64 element vector is called a wavefront2. Processing elements within
a compute unit execute in lock-step, whereas compute units execute independently in respect to each
other. Lock-step execution might pose problems if work-items from the same wavefront fall on different
branch paths, in which case all paths must be executed serially, thus reducing efficiency. This is because
work-items from the same wavefront share the same Program Counter (PC). For the case of this device
architecture family, the four arrays of 16-PE execute code from different wavefronts.
Figure 2.2: Organization of the AMD Southern Islands GPU architecture.
The current state-of-the-art of NVIDIA device architecture is Kepler (Compute Capability (CC) 3.X).
An example device from this family is the GeForce GTX 680 GPU, which includes the Kepler chip GK104.
This particular GPU is composed of 4 Graphics Processing Clusters (GPCs) each with 2 Streaming
Multiprocessor (SMX), and 4 memory controllers. Each SMX is in turn composed of 192 CUDA cores
(roughly equivalent to the AMD ALUs presented earlier). Each SMX contains 4 warp schedulers that
dispatch two instructions per warp with active threads, every clock cycle. The warp of a GPU is a number
of threads representing the finest grain of instruction execution of a multi-processor. For the Tesla, Fermi
and Kepler device architectures, this number is 32, meaning that at least 32 threads must execute the
same instruction. A warp is roughly the equivalent of a wavefront in AMD hardware.
A thread is said to be active if it is on the warp’s current execution path, otherwise it is inactive.
When a threads in the same warp follow different execution paths, the warp is said to be diverging
(the same thing happens for wavefronts in AMD hardware). For example, in the case of a kernel with
2Other AMD graphic card families may have different wavefront sizes.
10
Kepler (GTX 680 – GK104)
L2 Cache
SMX SMXGPC SMX SMXGPC
SMX SMXGPC SMX SMXGPCM
EMC
MEM
C ME
MC
MEM
C
GigaThread Engine
Streaming Multiprocessor (SMX)
Warp Scheduler Warp Scheduler Warp Scheduler Warp Scheduler
Dispatch Dispatch Dispatch Dispatch Dispatch Dispatch Dispatch Dispatch
SF ULD/ST
SF ULD/ST
SF ULD/ST...
SF ULD/ST
SF ULD/ST
SF ULD/ST
...
SF ULD/ST SF ULD/ST
64kb L1 Cache / Shared Memory
Register File (65536 x 32bit)
Texture Cache
Uniform Cache
Cuda Core
INT FP
Dispatch port
Operand Collector
Result Queue
(192 per SMX)
Figure 2.3: GTX 680 device Architecture.
a branch where some threads fall on different branch sections, all sections must be executed serially,
reducing parallel efficiency. To eliminate this problem, the programmer must try to align branch results
with Warp/Wavefront boundaries, enforcing that threads from the same Warp/Wavefront always fetch
the same instructions. This might not be possible at all if the branching is not predictable, or change the
algorithm memory access patterns in such way it does not compensate the gains acquired by reducing
divergence. In older GPU devices (i.e. Nvidia’s CC 1.X), optimal memory access access patterns where
restricted to a fairly reduced set, but in CC 3.X the allowed patterns that still offer optimal performance
are much more relaxed [40].
2.3 OpenCL
As previously mentioned, several paradigms exist for programming both CPUs and GPUs. However,
unlike most alternatives, OpenCL is supported by several different platforms, such as GPUs from mul-
tiple vendors, multi-core CPUs, DSPs and FPGAs [46]. Also, OpenCL simplifies orchestrating multiple
devices in an heterogeneous environment and allows writing portable code between different architec-
tures. Thus, OpenCL has been chosen for performing the proposed work.
OpenCL is organized in a hierarchy of models [22]: Platform Model, Execution Model, Memory
Model and Programming Model. Each of these models is explained in the following sections. The
OpenCL framework includes the OpenCL compiler (OpenCL C), the OpenCL platform layer and OpenCL
11
Runtime. In this project, the newest available OpenCL standard was used for each device (OpenCL 1.1
for the considered Nvidia GPUs and OpenCL 1.2 for the Intel CPUs and AMD GPUs).
2.3.1 Platform Model
The Plaform Model defines how a program maps into the OpenCL platform, which is an abstract
hardware representation of the underlying device. As depicted in Figure 2.4, the platform model is
composed of a Host connected to one or multiple OpenCL devices. An OpenCL device is a collection of
CUs, which in turn are divided into one or more PE3, where the computation is done. The code that runs
on the host uses the OpenCL Runtime to interface with the OpenCL device, to which it may enqueue
synchronization commands, data or kernels. A kernel is a function written in OpenCL C, and can be
compiled before or during program execution. Within each CU, PEs can execute either in SIMD or
Single Program Multiple Data (SPMD) fashion. In the former, PEs execute in lock-step, whereas in the
latter PEs keep their own program counter and may follow independent execution paths.
Figure 2.4: OpenCL Platform Model [22].
2.3.2 Execution Model
An OpenCL program executes over an index space in two main components: host code running
on the host device and kernel code that runs on each OpenCL Device. Kernel instances are called
work-items and are further grouped into work-groups. Each work-item has a unique identifier in the
global index space and in the local index space (local to each work-group). Index spaces are called
NDRanges and can have 1,2 or 3 dimensions, thus, the local and global indices are 1,2 or 3 dimensional
vectors. In Figure 2.5, an example of this organization is depicted for 2 dimensions. For GPU devices
the best performance should be attained when the work-group-size is an integer multiple of the warp-
size (NVIDIA) or the wavefront-size (AMD) because this is the minimum execution granularity supported.
Failing to meet this criteria will cause running work-group-size%warp-size useless threads.
To support different devices with different thread management systems, OpenCL employs a relaxed
synchronization and memory consistency model. This way, execution of work-items is not guaranteed
to follow any specific order. Nevertheless, explicit work-group barrier instructions can be placed in the
3This structure (and naming) closely resembles the one for AMD devices, presented in section 2.2.1.
12
Figure 2.5: Partitioning of work-items into work-groups.
kernel code to ensure execution synchronization between work-items of the same work-group. Synchro-
nization of work-items belonging to different work-groups is not possible during the same kernel launch,
a behaviour depicted in Figure 2.6. Memory consistency details are explored in Section 2.3.3.
BB
B
B
Instruction 1Instruction 2WG BarrierInstruction 3
11
1
1
22
2
2
Kernel Work-group (WG) 0
...
3
33
3
Kernel Launch 0
Work-Group Synchronization
Memory consistency inside each WG
Kernel Launch 1
...
B
BB
B
11
1
1
2
2
2
2
Work-group (WG) N
33
3
3 Global Synchronization
Memory Consistency between WG
Figure 2.6: Partitioning of work-items into work-groups.
Another important concept is an OpenCL Context, which includes a collection of OpenCL Devices,
a set of kernels, a set of Programs (source and compiled binary that implement the kernels) and a set
of Memory Objects. Associated with a Context is one or more Command Queues, via which the host
enqueues execution, memory and synchronization commands to the OpenCL Devices. This queue may
be set as in-order or out-of-order, which defines if the order by which commands are enqueued must be
13
respected or not.
2.3.3 Memory Model
The OpenCL standard defines four memory region types, each having different rules for access and
allocation:
(i) Global Memory: This memory reagion is accessible by all work-items for read/write operations.
Furthermore, the OpenCL-Host has read/write access and is responsible for dynamic memory al-
location. This memory may either be cached or not, depending on the target architecture. AMD
SI-GPU and newer NVIDIA devices, for example, have global memory caches accessible by each
CU. Global memory read/write consistency between work-items of the same work-group is only
guaranteed if they encounter a global work-group barrier. Conversely, there is no guarantee of
memory consistency across different work-groups, during the execution of a kernel. This behavior
is depicted in Figure 2.6.
(ii) Constant Memory: Memory Accessible by all work-items for read operations, remaining constant
during the execution of a kernel. Like the Global Memory, the Host has read/write access and is
responsible for memory allocation (Dynamic). Constant memory is usually cachable (e.g., in the
Kepler architecture it is implemented as a configurable fraction of the L1 cache) and typically has a
lower average access latency in respect to Global Memory.
(iii) Local Memory: This memory region is shared by work-items of the same work-group for read/write
operations. Allocation can be done either statically by a kernel or dynamically by the Host (although
the Host cannot access this memory region). It is usually implemented as dedicated memory in
each CU, but in some devices it can also be mapped into Global Memory. In AMD SI-GPU, this
memory is mapped into LDS (see Figure 2.2), whereas in Nvidia’s Kepler architecture it is mapped
into the Shared Memory (see Figure 2.3). Local memory is only consistent between work-items of
the same work-group after they encounter a local work-group barrier, as depicted in Figure 2.6.
(iv) Private Memory: Memory region private to each work-item, for read/write access. Neither the Host
nor other work-items can access this memory. It must be statically allocated in the kernel and is
usually implemented as registers in each CU.
2.3.4 Programming Model
The OpenCL standard supports two programming models: Data Parallel and Task Parallel. In the
Data Parallel programming model, parallelism is exploited by parallel executing the same set of oper-
ations over a large collection of data. Considering computation over data in an array, each work-item
executes an instance of the kernel in one array index (strictly data parallel model) or in more (relaxed
data parallel model). Hierarchical partitioning of work-items into work-groups can be defined explicitly
by the programmer or implicitly by the OpenCL implementation.
14
Conversely, in the Task Parallel programming model, a single instance of the kernel is executed,
where parallelism can be extracted by using vector types supported by the device or by enqueueing
multiple tasks (different kernels) to the Device. Intel SSE/AVX/AVX2 vector instructions, for example,
can be inferred by writing operations with OpenCL vector types (e.g. float4, int4).
2.3.5 OpenCL Runtime Parametrization
To account for the existing heterogeneity, the OpenCL Host can query the underlying platform through
the OpenCL library for the available devices and their specific characteristics. As an example, the pre-
ferred elementary work-group size of each device can be queried, typically returning the warp-size (32)
for Nvidia GPUs, and the wavefront size (64) for AMD GPUs. For Intel OpenCL compatible CPUs, this
number is usually equal (or higher) than 64 [24]. According to the results obtained from this device dis-
covery process, different work-group partitioning schemes may be used for each device (e.g., number of
work-items, work-group-size, amount of data per work-item, etc). Furthermore, to enable inter-platform
portability, the OpenCL framework offers the possibility of compiling the developed kernels in runtime,
allowing different compilation flags or kernel versions to be chosen, according to the target platform.
2.4 Load Balancing Techniques
When considering the trade-offs between the multi-core CPU and GPU architectures, it makes sense
to attempt a simultaneous exploitation of these computational platforms, by scheduling the workload to
the device best suited for each particular task. Figure 2.7 depicts an example network of heterogeneous
computing nodes, each comprised by multiple CPU cores and one or more specialized accelerators. In
particular, these accelerators can be GPUs with very different compute capabilities, or even other types
of hardware platforms (e.g. FPGAs, DSPs). In such an heterogeneous environment, HPC applications
Heterogeneous Node
GPU A
GPU B
Other (e.g. FPGA)
Node
Node
Node
Multi-Core CPU Node
Node
Node
...
...
HeterogeneousNetwork
Figure 2.7: Example of an heterogeneous network composed by several compute nodes, each com-prised by multiple CPU cores and one or more specialized accelerators.
frequently call for load balancing mechanisms to distribute the workload among the available processing
nodes. A simple and insightful way of posing a typical load balancing problem is the following: Consider
15
a cluster of p processing nodes; let ti(dki ) be the time taken by node i to compute over the assigned
data dki at iteration k, where i ∈ [0, p − 1[. The objective is that at some iteration k = b, all devices take
the same time to compute the assigned load, yielding dbi = dbj for every {i, j} node pair. Specialized
algorithms may take into account other performance metrics, such as consumed power [29] or inter-
node communication latency[28]. Furthermore, while some publications aim to present generic load
balancing methods, others focus in offering a solution for specific applications or scientific fields.
Load balancing algorithms found in the literature can typically be classified according to some fun-
damental characteristics [12]. First of all, the load balancing solution can either be Static or Dynamic.
Static [28] implementations evaluate the characteristics of the application and the target hardware plat-
form (either at compile-time or run-time) and make the workload distribution based on these data. For
example, in [28] the authors introduce an algorithm to find a subset of computing nodes in a complex
network that form an optimal virtual ring network, classifying candidate nodes by considering the pro-
cessing capabilities of each one and the bandwidth of the respective communication links. Conversely,
dynamic [10, 42] load balancing solutions take into account one or more performance metrics (e.g. time,
power, accuracy) measured in run-time and dynamically modify the workload distribution to best fit the
heterogeneous platform. For example, in [42], a dynamic load balancing algorithm is devised for the
Adaptive Fast Multipole Method (AFMM) method, which is a solver for n-body problems (e.g. Colliding
Galaxies, Fluid Dynamics). In order to balance the load in a cluster composed by 10 CPUs and 4 GPUs,
an adaptive decomposition of the particle space is employed, and is modified dynamically according to
a performance model that predicts the performance of future iterations using previous execution time
measurements.
Secondly, load balancing algorithms can either be Centralized [9, 10, 12] or Decentralized [28, 53].
The former concentrate load balancing decisions in one monitoring node that schedules the work among
the cluster, whereas the later rely on local decisions made on each computing node (possibly using
information from neighbour nodes) to distribute the workload among them. Furthermore, centralized load
balancing algorithms can be further classified as either Task-Queue [10] or Predicting-The-Future [9,
12, 42]. Task-queue algorithms rely on partitioning the work-load into several smaller tasks, which are
continually fetched by the computing nodes. Although they are a relatively simple solution to implement,
a high speed communication link is required between the node managing the task-queue and every
other computing node, since tasks are usually required to be fetched frequently (to ensure a fine-grained
balancing). Conversely, predicting-the-future approaches schedule the work depending on performance
measurements of past iterations. If the balancing solution is well implemented (and the target algorithm
allows it), these approaches can converge to stabilized work-load distribution, and cease to require
intensive inter-node communication.
Considering the importance of load balancing methods for scheduling the workload among hetero-
geneous devices, two balancing solutions were employed in the parallelization solution devised in this
dissertation. The first is a task-queue algorithm with a distributed balancing decision, whereas the sec-
ond is a centralized predicting-the-future dynamic load balancing approach. Details about these two
algorithms will be presented in Chapter 6.2 and Chapter 5.
16
2.5 Summary
In this chapter, an overview of both state-of-the-art CPU and GPU hardware was presented, and the
architectural differences between the two platform families were discussed. Following, an overview of
the OpenCL programming framework was introduced, highlighting the structure of the framework, and
the opportunities it offers to exploit a wide range of accelerators. The advantages of exploiting hetero-
geneous platforms comprised of CPU and GPU devices and the wide availability of these computational
resources among scientific research groups, led to the choice of targeting these type of systems. At this
respect, a brief review of the literature on load balancing solutions was discussed, presenting typical so-
lutions for addressing the problem of efficiently scheduling the workload in an heterogeneous computing
environment. Further details about the particular load balancing algorithms employed in this dissertation
will be presented in Chapter 4 and Chapter 5.
17
18
CHAPTER 3
PERTURBATIVE MONTE CARLOQM/MM
Contents3.1 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Computational Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Data Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
19
The Perturbative Monte Carlo QM/MM algorithm is a molecular simulation procedure designed to
study mixed QM/MM simulations. These simulations usually consider a circumscribed region of interest
(often referred to as active site) and an immersive environment. It takes a QM/MM system as input and
outputs other chemically viable configurations of the same system, sampled with the Metropolis Monte
Carlo rule [32]. As introduced earlier in this dissertation, the Metropolis MC method samples the sys-
tem in the ensemble space, rather than following a time coordinate. With this method, a sequence of
random configurations is obtained on the basis of Maxwell-Boltzmann statistics, by performing random
movements at each time frame and by evaluating the corresponding change of the system energy. The
underlying methods for calculating this energy can be traditional MM, QM or mixed QM/MM methods.
MM approaches represent atoms and molecules through ball and spring models, with heavily param-
eterized functions to describe their interactions. Alternatively, QM approaches explicitly simulate the
electrons, at a cost of a much higher computational burden. An alternative solution (which is employed
in the algorithm herein studied) consists of a mixed QM/MM approach, which combines the strengths of
each method. In this case, a small active region is simulated with QM, while the remaining environment
is represented by classical MM. Achieving a comprehensive understanding of the algorithm structure
represents a fundamental step to devise the best parallelization approach. Accordingly, a brief char-
acterization of the QM/MM simulations under study, together with an overview of the PMC method, is
presented in this chapter. Following, a computational complexity analysis and the strategy that will be
applied for the algorithm parallelization is introduced. Finally, the related work on accelerating molecular
simulation algorithms is discussed.
3.1 Algorithm Description
For the purpose of describing the PMC method, a chemical solution composed by a solute (region of
interest) and a solvent (environment) will be herein considered as an example. Accordingly, Figure 3.1
depicts a schematic of such a system, comprised by a single solute molecule (molecule C), which is
treated at the QM level, and two solvent molecules (molecules A and B), treated at the MM level. Then,
by applying a Metropolis MC step, one of these molecules will be randomly picked, translated and
rotated to generate a new structure. This last MC step will be either accepted or rejected, according to
the resulting energy change. MC steps are accepted if the energy of the obtained configuration is lower
than the previous configuration reference (which is the last accepted configuration) or accepted with a
probability1 e∆E
KBT if the energy of the system has risen.
As depicted in Figure 3.1, the energy change is computed by considering two types of interactions
with the changed molecule (e.g. molecule A), resulting in either QM/MM energy terms or pure MM
terms. The QM/MM terms account for the interaction with the QM solute (molecule C), whereas the MM
energy terms account for the interaction with every other solvent molecule (in this case, just molecule
B). Furthermore, for both levels of theory (QM/MM or pure MM), Coulomb and Van der Waals (vdW)
contributions have to be considered.
1Boltzmann distribution, where KB stands for the Boltzmann factor and T for temperature.
20
AB
C
A B
C
B
C
A
QM/MM InteractionMM InteractionMonte Carlo Step
Before Monte Carlo Step After Monte Carlo Step
QM Region
MM Region
QM/MM System
Figure 3.1: A system composed of one QM molecule (C) and two MM solvent molecules (A and B). Foreach MC step, the difference in energy between the molecule moved (A) and every other molecule hasto be computed, but at different levels of theory.
Monte Carlo Step
Coulomb Nuclei QMMMVDW QMMMVDW MMCoulomb MM
+Update
Reference
Output Result
yAccept ?
QMupdate
PMCcycle
PMCcycle
QMupdate
PMCcycle
Coulomb Grid QMMM
Δ EMMC Δ EMM
vdW Δ EQM/MMvdW Δ EQM/MM
C,nuclei Δ EQM/MMC,grid
Δ E
PMC Cycle: KCycle steps PMC : K PMC iterations
Figure 3.2: Perturbative Monte Carlo QM/MM with focus on the simulation bottleneck (PMC cycle, right).Arrows represent data dependencies.
Furthermore, Figure 3.2 illustrates the dataflow of the target PMC method. In each PMC Cycle
(right), KCycle Monte Carlo steps of the MM subsystem are executed, while keeping the QM region
static. In another process, the electronic density of the QM region is updated (QM Update) by using
MOLPRO[55], and the result is subsequently used in the next PMC Cycle. As described earlier, the
system energy variation has to be computed at each MC step (henceforth referred to as a PMC Cycle
step), given by the expression:
∆E = ∆ECMM + ∆EvdW
MM + ∆EvdWQM/MM + ∆EC,nuclei
QM/MM + ∆EC,gridQM/MM (3.1)
where each partial ∆E term corresponds to an energy contribution computed in a particular PMC Cy-
cle procedure (see Figure 3.2). As previously introduced, each PMC Cycle step consists in selecting,
translating and rotating a random MM molecule, computing ∆E (See Equation 3.1), and checking the
current QM/MM system configuration for acceptance. In order to store the obtained results, every Foutput
iterations (see Table 3.1), the current QM/MM configuration is written in an output file. Moreover, despite
being a good example for illustrating the algorithm, Figure 3.1 only depicts a very small system. Con-
versely, a more general QM/MM run will have a much higher number of molecules, and is characterized
21
Table 3.1: QM/MM Run Characterization, together with the typical parameter range for the benchmarksconsidered in this work. For the case of homogeneous solvents, the Z
(i)MM parameter (concerning
molecule i) will be the same for every MM molecule.
Input QM/MM SystemParameter Description Typical RangeNQM Number of QM grid points [105, 107]
NMM Number of MM Molecules 103
ZQM Number of atoms in the QM region [10, 102]
Z(i)MM Number of atoms per MM molecule [1, 10]
AMM Number of MM atoms [103, 104]
Run ParametersKPMC Number of PMC iterations [104, 107]
KCycle Number of PMC cycle steps (per PMC iteration) [10, 103]
Foutput Output write frequency (procedure Output Result, Figure 3.2) [103, 104]
by the parameters introduced in Table 3.1.
The Coulomb QM/MM energy computation is of particular interest since it is the most computational
intensive calculation in each PMC Cycle. This energy contribution is accounted for by two distinct terms,
∆EC,nucleiQM/MM and ∆E
C,gridQM/MM. The former accounts for the interaction with the atoms of the QM molecule
represented by classic nuclei centred charges, whereas the latter accounts for the interaction with the
QM electronic wave represented by a grid of point charges (henceforth referred to as grid). Between
the two, the ∆EC,gridQM/MM term is largely more computational intensive (see Section 3.2 for more details)
and corresponds to a discretization of the integral shown in Equation 3.2, where ZMM and NQM follow
the definition given in Table 3.1, ρ(.) is the electronic density function, q the charge and r the distance
between the changed molecule and each grid point.
∆EC,gridQM/MM =
ZMM∑j
∫ρ(r)
qjri,j
dr GRID−−−−→ZMM∑
j
NQM∑i
qiqjri,j
(3.2)
The pseudo-code for the Coulomb Grid QM/MM energy computation (∆EC,gridQM/MM) is presented in Algo-
rithm 1. As shown, for each {atom, grid point} pair (considering the atoms of the displaced molecule),
the Coulomb potential is computed. Furthermore, since periodic QM/MM systems (defined by a repeat-
able simulation box) are herein considered, the spacial range of the considered electrostatic interactions
(i.e. Coulomb, vdW) has to be limited by a cutoff distance in space (rc). Accordingly, shifted potentials
(Vshift) [16] are used in the ∆EC,gridQM/MM interaction terms
Vshift =
{1r −
1rc
+ 1r2c(r − rc) r < rc
0 r ≥ rc(3.3)
affecting each term differently, depending on to the distance between each {atom, grid point} pair (r),
and completely disregarding (set to 0) the interaction whenever r > rc. The usage of shifted potencials
can be observed in Algorithm 1, resulting in four possible space regions depending on the distance
between the considered grid point and both the old and the new set of coordinates of each atom of the
displaced molecule. Hence, four slightly different energy expressions (resulting from the application of
22
Vshift) may be computed. As discussed further in this dissertation, the procedure presented in Algorithm 1
will be one of the main targets of parallelization.
Algorithm 1 Coulomb Grid QM/MM energy (∆EC,gridQM/MM). See Table 3.1 for parameter definitions.
Define: atom := {position = {x, y, z}, chemical params = {σ, ε, q}}Init: Energy = 0.0Init: rc → Coulomb cutoff (run parameter)
1: for each atom i in changed molecule do [ Z(chmol)MM cycles ]
2: for each point j in charge grid do [ NQM cycles ]
3: rold = distance(i, j) in reference system4: rnew = distance(i, j) in new system5: qs = −qi × qj6: if rnew < rc and rold < rc then
7: Energy += qs× ( 1rnew
− 1rold
+ 1r2c
(rnew − rold))
8: else if rnew < rc and rold ≥ rc then
9: Energy += qs× ( 1rnew
− 1rc
+ 1r2c
(rnew − rc))
10: else if rold < rc then
11: Energy −= qs× ( 1rold− 1rc
+ 1r2c
(rold − rc))
12: end if13: end for14: end for
3.2 Computational Complexity Analysis
The computational complexity of the PMC QM/MM depends on the complexity of the program proce-
dures that comprise the PMC Cycle (see Figure 3.2). The Monte Carlo Step procedure, which consists
in rotating and translating a random molecule (henceforth referred to as chmol), has a complexity pro-
portional to the size of that MM molecule
O(Monte Carlo Step) = Z(chmol)MM (3.4)
usually having a very light execution time footprint. On the other hand, the complexity of the Coulomb
Grid QM/MM procedure (Algorithm 1) is proportional to the product of the size of chmol by the number
of grid points
O(Coulomb Grid QM/MM) = Z(chmol)MM ×NQM (3.5)
which will usually be the most time consuming procedure, since NQM is typically a big number. The
other two Coulomb computations have identical algorithm structures, although the involved data differs.
Coulomb Nuclei QM/MM uses nuclei centred point charges instead of the electronic grid, and thus it’s
complexity is given by:
O(Coulomb Nuclei QM/MM) = Z(chmol)MM × ZQM (3.6)
yielding a much lower complexity in comparison with Coulomb Grid QM/MM. On the other hand, Coulomb
MM computes the interaction between each atom of chmol and each atom of every other MM molecule.
Hence, it’s complexity is given by:
O(Coulomb MM) = Z(chmol)MM ×
NMM∑i
Z(i)MM (3.7)
23
which would be simplified to (Z(chmol)MM )2 × NMM , for the case of homogeneous solvents. The total
number of MM atoms (AMM ) may also be used in this text as the complexity variable, considering that
AMM =∑NMM
i Z(i)MM , for either homogeneous or heterogeneous solvents.
The vdW energy calculations have a completely different energy expression, as presented in Algo-
rithm 2, which shows the pseudo-code for vdW MM . Likewise, the vdW QM/MM procedure shares
an identical structure, although it loops over the QM nuclei centred charges instead of the MM atoms.
Similarly to the Coulomb computations, the vdW procedures have a nested for-loop structure with four
cutoff branches. Thus, the complexity of the vdW procedures is identical to their Coulomb counterparts,
yielding the following expressions:
O(VDW MM) = Z(chmol)MM ×
NMM∑i
Z(i)MM (3.8)
O(VDW QMMM) = Z(chmol)MM × ZQM (3.9)
Finally, each PMC Cycle step terminates with an update of the current system reference and output
writing. The reference update complexity is proportional to the size of the changed molecule, whereas
the output saving is proportional to the total number of MM atoms over the writing frequency:
O(Update Reference) = Z(chmol)MM (3.10)
O(Output XYZ) =
∑NMM
i Z(i)MM
Foutput(3.11)
Having in mind the typical magnitude of the QM/MM parameters (see Table 3.1), one can deduce
that the most computational intensive procedure is the Coulomb Grid QM/MM. This will be taken into
consideration when parallelizing the PMC program procedures. By accounting for all the PMC Cycle
procedures, the complexity of one PMC Cycle step results in:
O(PMC Cycle) = Z(chmol)MM × (NQM + ZQM +AMM ) +
AMM
Foutput(3.12)
by recalling that AMM =∑NMM
i Z(i)MM . Considering the typical ranges for these parameters (see Ta-
ble 3.1), one can observe that the leading term will be Z(chmol)MM ×NQM . In particular, the ×NQM param-
eter will have the heaviest footprint on the resulting complexity.
3.3 Data Dependencies
The PMC Cycle operates over three main data structures, which are depicted in Figure 3.3. Firstly,
the changed molecule (chmol), which is composed by Z(chmol)MM atoms, each represented by three di-
mensional Cartesian coordinates x, y, z and chemical constants σ, ε, q. Secondly, the QM grid, which is
composed by NQM point charges, each also represented by Cartesian coordinates and a charge2 (q).
Finally, the MM lattice, which comprises all MM molecules, including the chmol data before the MC step
takes place, and the QM molecule represented with classical MM nuclei (ZQM atoms).
2In this case, the charge is not constant because it is modified (alongside with the coordinates) by the QMUpdate process.
24
Algorithm 2 VDW MM energy (∆EvdWMM ). See Table 3.1 for parameter definitions.
Define: atom := {position = {x, y, z}, chemical params = {σ, ε, q}}Init: Energy = 0.0Init: rc → van der Waals cutoff (run parameter)
1: for each atom i in changed molecule do [ Z(chmol)MM cycles ]
2: for each atom j in every other MM molecule do [∑NMMj Z
(j)MM = AMM cycles ]
3: rold = distance(i, j) in reference system4: rnew = distance(i, j) in new system5: if rnew < rc and rold < rc then
6: Energy +=√εi × εj ×
((σi×σjr2new
)6 − (σi×σjr2old
)6 − (σi×σjr2new
)3 + (σi×σjr2old
)3)
7: else if rnew < rc and rold ≥ rc then
8: Energy +=√εi × εj ×
((σi×σjr2new
)6 − (σi×σjr2new
)3)
9: else if rold < rc then
10: Energy +=√εi × εj ×
((σi×σjr2old
)6 − (σi×σjr2old
)3)
11: end if12: end for13: end for
(N MM×ZMM+ZQM )×{ x , y , z ,σ ,ϵ , q }
NQM×{ x , y , z , q }
ZMM×{ x , y , z ,σ ,ϵ , q }MC
constants
chmol
grid
lattice
Figure 3.3: Main data structures used in the PMC Cycle. Refer to Table 3.1 for parameter definitions.
Figure 3.4 shows the data dependencies of each process in the PMC Cycle. In particular, the data
structure corresponding to the changed molecule (chmol) is written by the Monte Carlo Step and sub-
sequently read by all the energy calculation procedures, which compute their respective ∆E energy
terms to be processed by the Decide & Update procedure. Then, if the step under consideration is
accepted, the lattice corresponding to the MM Region (see Figure 3.1) is updated with the tested chmol
configuration, and a new Monte Carlo Step may take place. Unlike the other data structures, the grid
corresponding to the QM Region (see Figure 3.1) is not modified within the PMC Cycle. Instead, it is up-
dated by the QM Update process. Hence, considering the described data dependencies within the PMC
Cycle, it is observed that the energy contribution procedures can be executed in parallel with respect
to each other. Furthermore, even the energy calculations are amenable to parallelism, as each of them
can be mapped to a parallel reduction structure. For the particular case of the Coulomb Grid QM/MM
procedure, this can be verified by inspecting Algorithm 1, although the other energy calculations share
the same structure, apart from the energy expression and the involved data (e.g., see Algorithm 2).
Having this in mind, the PMC Cycle is the main target of study and parallelization in this work. At
this respect, several OpenCL kernels were devised to extract the available parallelism in the PMC Cycle
procedures, as well as a capable Host-side management framework to schedule the work among the
25
Monte Carlo Step
Coulomb Nuclei QMMMVDW QMMMVDW MMCoulomb MM Coulomb Grid QMMM
grid
lattice
chmol
ΔE
Decide & Update
Data
Procedure
write read
Legend
Figure 3.4: Data dependencies within the PMC Cycle. The VDW QMMM and Coulomb Nuclei QMMMprocesses only read the atoms that are part of the QM molecule, not the whole lattice.
available computational resources. The internal implementation of the QM Update will be kept mostly
unchanged3, apart from simple add-ons to accelerate inter-process communication. Nevertheless, a
scalable multiple Markov chain solution, which exploits parallelism in the MC state-space sampling, was
designed to accelerate the QM Update procedure. Chapters 4 and 5 discuss the devised solution in
detail.
3.4 Related Work
Due to the computational complexity of molecular simulation procedures, there has been substantial
research work. The literature describing this work can be grouped by: i) the nature of the employed
sampling, ii) the type of theory used for the energy calculations and iii) the chemical application for
which they have been tuned to. The employed sampling is usually performed in time (MD) or in state-
space (MC) and the energy interactions may consider pure QM, pure MM and mixed QM/MM terms.
Furthermore, for the same state-space sampling strategy, several variants may be considered. For
the case of MC sampling, this includes (among other possible approaches) the Diffusion Monte Carlo
(DMC) [34], the Variational Monte Carlo (VMC) and the PMC [51]. Finally, the application for which the
algorithm has been tuned to, may vary greatly, and this is the main reason for which the performance
gains attained in the parallelization of the algorithms in this field can seldom be compared to each other.
MD is a popular approach to studying a wide range of dynamical properties, and it led to several ac-
celeration works dating from the earlier times of GPGPU [17, 48] to more recent publications [31, 38, 44].
On the other hand, methods based on MC sampling allow simulating systems with longer timescales,
and several works have also accelerated these algorithms by following GPGPU approaches [2, 3, 13,
23, 30, 52]. Our work falls into the latter category (MC) and therefore we shall present a more detailed
review of those works.
The work in [2] presents a GPGPU solution for Quantum Monte Carlo (QMC), achieving up to 30×
speed-up in individual kernels and up to 6× speed-up in the overall execution. The QMC variety that is
considered by such research is based on DMC, unlike the PMC approach followed in our work. They
3The MOLPRO program suite is a closed source commercial tool and is performing extremely complex calcula-tions in the QM Update procedure. Besides not having the main code available (aside from user scripts), it is notthe bottleneck of the PMC QM/MM, and thus optimizing it is out of the scope of this work.
26
employ a scheme for simultaneous state-space exploration (each chain called a walker) by using a
scheme similar to the multiple Markov chain approach that is herein adopted. However, they emphasise
on exploiting a high amount of parallelism at the walker level (up to 16 simultaneous walker evaluations
on the GPU), whereas we focus on exploring the finer-grain level of parallelism on each chain (which are
heavy enough to keep the GPU busy, in our case), and manage chain-level parallelism with fewer chains
per GPU. We took this approach since spawning a very large number of chains on the same GPU would
be unfeasible for the case of the PMC method, since each chain requires to compute not only the MC
step trials (in this case, the PMC Cycle procedures), but also the intrinsically serial QM Update process.
In [13], the authors discuss a GPGPU parallel approach to continuum QMC, by considering the
DMC. They target Nvidia GPUs by using the CUDA framework, and MPI to schedule the work among
computational clusters, exploiting walker-level and data-level parallelism, and achieving full-application
speed-ups from 10× to 15× in respect to a quad-core Xeon CPU implementation. However, unlike the
work herein described, they do not target QM/MM systems, focusing only on QM applications.
The work described in [52] uses MC sampling (based on Variational Monte Carlo) and targets
QM/MM systems, by exploiting computational clusters composed of heterogeneous nodes. Accord-
ingly, since their performance bottleneck is on the calculation of the electrostatic potential, they use
GPUs to handle the bottleneck code and CPUs for the remaining procedures, obtaining a speed-up of
up to 23.6× versus a single-core CPU. The adopted GPGPU framework is CUDA, and an MPI solution
is proved scalable up to 4 CPU cores. They do not report any explicit load balancing solution nor target
the simultaneous exploitation of heterogeneous GPU platforms, contrary to the work herein presented.
In [3], the authors describe a CUDA GPGPU implementation for many-particle simulations using MC
sampling. They partition the particle set in several cells and apply many MC steps in parallel, known
to not interfere with each other. They do not target QM/MM systems. Instead, tests are performed
for a ”hard disk” system (two-dimensional particles which cannot overlap), and the considered particle
interactions are the physical collisions. Unlike physical collisions, the electrostatic potentials considered
in our work have a much higher range, and as such the computed energy terms at each MC step
depends on a much larger number of their neighbour molecules (the potencial cuttofs are about half of
the simulation box). Therefore, such scheme would not be effective to solve the problem that is herein
considered, as most MC steps would interfere with each other. The work presented in [23] also describes
a parallel approach to particle MC simulations using CUDA, without any emphasis on QM/MM systems.
Finally, the work in [38] targets QM/MM simulations, although time sampling (MD) is used instead,
and a special focus is given to accelerating the QM grid generation, achieving up to 30× speed-up. This
contrastes to what happens in the PMC, where the bottleneck is found in the QM/MM electrostatics (the
PMC Cycle), which is significantly accelerated by our implementation.
Before concluding, it is worth recalling that direct performance comparisons are difficult to handle
in this field, and very few authors do it in the literature. Furthermore, very few have considered the
usage of heterogeneous architectures for hybrid QM/MM simulations, whilst using MC sampling. Our
solution efficiently takes advantage of the hybrid nature of QM/MM simulations and the MC state-space
exploration, unlike typical pure QM or MM approaches.
27
Most existing works adopted CUDA as the programming framework, being constrained to Nvidia
GPUs. To circumvent this limitation, other frameworks have been developed to ease the programming
of non-conventional architectures, such as StarPU [5] and OpenCL [22]. Due to its simpler means to
orchestrate multiple devices in an heterogeneous environment and to write portable code between differ-
ent architectures, the later was used in this work. Moreover, by allowing an easy extension with the MPI
framework, the proposed approach still leaves opened the possibility to exploit further performance scal-
abilities at the chain-level, since the most challenging fine-grained part consisting on the parallelization
of the PMC Cycle was already overcomed.
3.5 Summary
In this chapter, a brief characterization of the QM/MM simulations under study, together with an
overview of the PMC method, was presented. Then, a computational complexity analysis of the PMC Cy-
cle procedures was conducted, revealing the computational bottlenecks and concluding that the Coulomb
Grid QM/MM procedure is the most computational intensive step of the PMC Cycle. In particular, the
dominating term was shown to be the number of QM grid points (NQM ). Following, a description of the
data dependencies present in the PMC Cycle was presented, laying out the basis for the paralleliza-
tion strategy presented in the following Chapters. Finally, the related work on accelerating molecular
simulation algorithms was discussed and commented on. It was concluded that despite existing a vast
diversity of research done on this particular field of application, there are still novel contributions from
this dissertation. In particular, heterogeneous architectures were seldom considered, and the usage
of the multi-platform multi-paradigm OpenCL framework, as well as the targeting of the particular PMC
QM/MM method, are among the novel contributions of the work herein presented.
28
CHAPTER 4
PARALLEL HETEROGENEOUSSOLUTION
Contents4.1 Original PMC QM/MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Exploiting Markov Chain Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3 Parallelization Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.4 Data Structure Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
29
The objective of this work is to accelerate the execution of the PMC QM/MM algorithm, by exploiting
heterogeneous platforms composed by a multi-core CPU and one or more OpenCL accelerators (e.g.,
GPUs). In this Chapter, a top-level description of the devised parallel solution is introduced. Firstly, the
original PMC QM/MM approach (developed at the Free Floater Research Group) is briefly described.
Then, an introduction on applying Markov chain theory to MC simulations is herein presented, focusing
on the particular case of the PMC QM/MM simulation method. After this, the overall structure of the
parallelization strategy is laid out, discussing details about the developed OpenCL Host program and
the work-flow of the complete application. Then, a coarse-level load balancing solution to schedule the
Markov-Chain workload among heterogeneous devices is described. Finally, a few preliminary data-
structure optimizations are discussed. A detailed description of the developed OpenCL Kernels, as well
as a second load balancing algorithm for scheduling finer-grained workloads, is presented in Chapter 5.
4.1 Original PMC QM/MM
The starting point for the parallelization study developed in this dissertation was the original PMC
QM/MM algorithm implementation, provided by the Free Floater Research Group - Computational Chem-
istry and Biochemistry, Institut fur Physikalische Chemie, Georg-August-Universitat Gottingen. This orig-
inal approach was designed to run on a single-core CPU, by executing two interleaving UNIX processes:
the PMC Cycle and the QM Update, which communicated via a file (hard-disk I/O) between PMC itera-
tions. The PMC Cycle was developed at the Free Floater Research Group and the complete C++ source
was made available for this work. The QM Update is comprised by a few FORTRAN user-scripts (also
developed at the Free Floater Research Group) which call MOLPRO routines. In contrast with the other
program parts, the MOLPRO program suite is a closed source commercial tool, which is performing
extremely complex calculations in the QM Update procedure. Since the code is not available for opti-
mization and since this procedure is not the bottleneck of the original PMC QM/MM (as it will be shown
in Figure 6.2), optimizing it was deemed to be out of the scope of this work. Nevertheless, a method
for executing several instances of the QM Update in parallel will be introduced in this dissertation, by
exploring multiple Markov chain parallelism, a topic discussed in the following section.
4.2 Exploiting Markov Chain Parallelism
In the context of the Metropolis MC sampling method [32], a sequence of accepted steps is called a
Markov Chain [19, 20]. For the particular case of the PMC QM/MM algorithm, a Markov Chain represents
a sequence of accepted QM/MM system configurations, which are generated by independent instances
of the PMC (PMC Cycle + QM Update). As depicted in Figure 4.1, several independent MC state-space
exploration chains may coexist, each generating an independent sampling of the conformal space of the
target QM/MM system.
The exploitation of multiple Markov chains in general purpose MC methods has been addressed in
several works [6, 45], and even in the context of a CPU-GPU environment [57]. In the next subsections,
details for the particular case of exploring Markov chain parallelism in the PMC QM/MM method are pre-
30
...
...
✔
✔ ...
✔
...
✔
Chain 0 Chain 1
Legend
Accepted Configuration
Changed Molecule (chmol)
✔
✔
Chain 0Output
Chain 1OutputIndependent state-space exploration
and output generation
Figure 4.1: Independent MC state-space exploration chains (illustrative example for 2 chains), eachgenerating an independent sampling of the conformal space of the target QM/MM system.
sented. To keep the devised approach as general as possible, and by considering the vast diversity of
computational platforms that are commonly available today, two distinct QM/MM simulation approaches
deserve particular attention: Running less Markov Chains than the available number of OpenCL accel-
erators, and the opposite case. In particular, the former is typically found in many-node computational
clusters, since these hardware platforms may have more computing nodes than the desired number of
independent Markov chains one wishes to spawn, in order to achieve the desired statistical properties
of the MC sampling. To address this case, specially tailored load balancing approaches are required,
since data from the same Markov chain exploration context has to be shared between several (possibly
heterogeneous) nodes. The approach for balancing the work of a single Markov chain among several
devices is presented in Chapter 5.
4.2.1 Multiple Markov Chain Parallelism
As introduced earlier, several MC state-space instances can be sampled by running several Markov
chains in parallel, thus allowing the simultaneous execution of the respective PMC Cycles. Furthermore,
this technique also allows executing the respective QM Update processes for the several chains in par-
allel. Since the PMC Cycle is the bottleneck of the PMC QM/MM method (as will be shown in Figure 6.2)
and provides a several opportunities to extract task-level and data-level parallelism (see Section 3.3), it
will be executed on OpenCL accelerators. On the other hand, since it is an intrinsically serial procedure,
the QM Update will be executed by spawning independent Markov chain instances on multiple CPU
cores. The MC state-space sampling layout corresponding to this approach is shown in Figure 4.2 (left),
together with the corresponding execution time-flow (right). Although the depicted example corresponds
to three independent chains, this number can scale with the available computational resources, as more
OpenCL accelerators and CPU cores are added to a given hardware configuration. It is important to note
that, although the PMC Cycle was the computational bottleneck in the original implementation, the high
31
QMUpdate
0.0
PMCCycle
0.0
PMCCycle
1.0
PMCCycle
2.0
PMCCycle
0.1
PMCCycle
1.1
PMCCycle
2.1
QMUpdate
1.0
QMUpdate
2.0
chain 0
PMCCycle
0.0
PMCCycle
1.0
PMCCycle
2.0
QMupdate
0.0
QMupdate
2.0
QMupdate
1.0
PMCCycle
0.1
PMCCycle
1.1
PMCCycle
2.1
Seed
... ... ...
chain 1 chain 2
State-Space
...
...
...
Execution
GPPcore
GPPcore
GPPcore
OCLAccel.
time
Figure 4.2: MC State-Space alongside with the execution timeline for three Markov chains.
performance speed-ups attained in the acceleration of this procedure considerably reduced its execu-
tion time (more details in Chapter 6). Therefore, depending on the considered acceleration platform, the
ratio between the execution times of the QM Update (t(QMupdate)) and the PMC Cycle (t(PMCcycle))
might vary considerably. Having this in mind, and by observing Figure 4.2, one can conclude that the
maximum number of independent Markov chains that can be spawned depends on the t(QMupdate)t(PMCcycle)
ratio
in the following manner:
maxchains = #Accelerators× t(QMupdate)
t(PMCcycle)+ 1 (4.1)
where the ratio t(QMupdate)t(PMCcycle)
represents the number of (accelerated) PMC Cycles required to occupy the
OpenCL accelerator while the CPU is handling the QM Update (for the sake of keeping the example in
Figure 4.2 as simple as possible, it was assumed a ratio of 2, although bigger ratios are usually observed
in read datasets - see Chapter 6). Moreover, maxchains will also be limited by the number of CPU cores
available to run the QM Updates. Since the QM Update process relies a lot on I/O disk communication,
the performance of the Host CPU may start to degrade when a higher number of processes are spawned
(as shall be shown in Chapter 6).
The multiple Markov chain parallelism strategy presented in [57] and in the some of the works dis-
cussed in Section 3.4, relies on a very high number of Markov chains to exploit parallelism in the many-
core GPU architecture. For the particular case of the approach introduced in [57], a GPU thread is
spawned to manage each Markov chain. This approach would be unfeasible for the case of the PMC
method, since each chain requires to compute not only the MC step trials (in this case, the PMC Cycle
procedures), but also the intrinsically serial QM Update process. To tackle this limitation, the approach
herein presented focused instead on exploiting task and data-level parallelism in each PMC Cycle step
(as will be discussed in Chapter 5), as well as chain-level parallelism by scheduling the tasks associ-
ated with each Markov chain (PMC Cycle and QM Update) among multiple CPU cores and OpenCL
accelerators.
32
4.3 Parallelization Strategy
By considering the Multiple Markov chain parallelism method introduced in Section 4.2.1 and the
PMC Cycle data dependency analysis presented in Section 3.3, three levels of parallelism can be ex-
tracted in the PMC QM/MM method: i) running several independent Markov Chains (chain-level par-
allelism); ii) executing the PMC Cycle procedures in parallel in respect to each other (task-level par-
allelism); iii) executing the inner iterations of each procedure in parallel, for different sections of the
dataset (data-level parallelism). At this respect, Figure 4.3 depicts the exploitation of these levels of
parallelism in the PMC QM/MM method. As discussed in Section 4.2.1, the PMC Cycle will be executed
on OpenCL accelerators, whereas the QM Update will be executed by spawning independent Markov
chain instances on multiple CPU cores. To accomplish this approach, the devised parallel solution is
mainly composed by: i) a C++ Host-side CPU program (henceforth referred to as the Host-Program)
to manage the OpenCL devices and the QM Update processes; ii) a UNIX pipe interface written in C
to manage communications between the Host-Program and the QM Update procedures (replacing the
original file-based communication); iii) a set of OpenCL kernels to accelerate the PMC Cycle execution.
Monte Carlo Step
Coulomb Nuclei QMMMVDW QMMMVDW MMCoulomb MM Coulomb Grid QMMM
Decide & Update
wi wi wi wi
+
L L L L
+
wi wi wi wi
L L L L
+ +
+ +
+
L
L
L
L
L L
Task-Level ParallelismChain-Level Parallelism Data-Level Parallelism
wi wi wi wi
+
L L L L
+
wi wi wi wi
L L L L
+ +
+ +
+
L
L
L
L
L Lwi wi wi wi
+
L L L L
+
wi wi wi wi
L L L L
+ +
+ +
+
L
L
L
L
L L
chain 0
PMCCycle
0.0
PMCCycle
1.0
PMCCycle
2.0
QMupdate
0.0
QMupdate
2.0
QMupdate
1.0
PMCCycle
0.1
PMCCycle
1.1
PMCCycle
2.1
... ... ...
chain 1 chain 2
QMupdate
0.1
QMupdate
2.1
QMupdate
1.1
(illustrative example for 3 chains)
Figure 4.3: Simultaneous exploitation of chain-level, task-level and data-level parallelism in the PMCQM/MM method.
The described management approach was taken for several reasons. Firstly, a centralized Host-
Program approach was taken, since the hardware setup targeted in this thesis is a single compute node
composed by multi-core CPUs and heterogeneous GPUs, and in this case the overhead of centralized
management is not a problem. Although the presented approach could be scaled to a multi-node com-
puting environment (e.g., using MPI), this was not considered to be a priority in this dissertation, since a
single-node heterogeneous system already allows a fairly extensive study of the employed parallelization
and load balancing schemes. Secondly, the original file-base communication system was substituted by
UNIX pipes in order to: i) free the disk from I/O burden as much as possible (since the MOLPRO program
package used in the QM Update already uses the hard-drive intensively for temporary files); ii) provide
a faster communication medium (if sufficient memory is available, the pipe inter-process communication
transfers are executed via main memory). For the QM Update side of communications, a FOTRAN/C
binding was used, and all the pipe communications code was developed in C, due to an easier access
33
1
OCLManager
OCLDevice 0
OCLDevice 1
OCLDevice D
OCLChain 0.0
OCLChain 0.1
OCLChain 0.C QM Update 0.C
QM Update 0.1
QM Update 0.0
... OCLChain D.0
OCLChain D.1
OCLChain D.C
.....
.
QM Update D.C
QM Update D.1
QM Update D.0
......
PMC HostProcess
2
3 4
OpenCLDevice D
OpenCLDevice 0
Legend
Thread
Process
Device OpenCL Command Queue
Unix Pipe
Cond. Variable Synchronization
PMC Host-Process
QM Update
FILE
Multi-Process/Multi-Thread PMCOriginal PMC
Figure 4.4: Multi-process/multi-threading structure of the designed parallel solution for the PMC method(right), alongside with the original dual-process approach (left).
to system functions from this language. Finally, since it is the bottleneck of the PMC QM/MM method, a
particular higher focus was given on accelerating the PMC Cycle procedure with OpenCL kernels. Due
to being the main target of acceleration in this work, the acceleration of the PMC Cycle will be discussed
in higher detail in Chapter 5. Likewise, in order to keep the description of the devised approach man-
ageable, this chapter will focus on describing the top-level parallel approach, leaving a more detailed
description of the finer-grain parallelism exploitation and load balancing to Chapter 5.
4.3.1 OpenCL Host-Side Management
Figure 4.4 presents the original dual-process PMC approach1, alongside the multi-process/multi-
threading structure of the designed solution. For the case of the later approach, the PMC Host-Process
is mainly composed by i) a centralized thread to manage synchronization and balancing among all
OpenCL devices (OCLManager, label 1); ii) a thread dedicated to managing the OpenCL command
queue operations (OCLDevice, label 2) for each device; iii) a thread dedicated to each Markov chain
(OCLChain, label 3), responsible for managing inter-process communication between the PMC Host-
Process and the QM Update processes (label 4). To accomplish inter-thread synchronization, the mutex
and conditional variable directives where used. Furthermore, inter-process synchronization and com-
munication was accomplished via the use of UNIX pipes, connecting each OCLChain thread to the
corresponding QM Update process. This pipe mechanism was implemented to substitute the original
file-based (disk I/O) communication system (Figure 4.4, left).
1Despite being the original PMC implementation, it is not used as the performance baseline in this dissertation,since it would not allow a representative assessment of the performance gains, in respect to the devised solution.The performance baseline is defined in Chapter 6.
34
Figure 4.5 depicts the execution work-flow of the parallel PMC program, for the case of a single-
device single-process instance (in order to keep the example manageable). The program starts by
reading the input file (step 1, Figure 4.5) containing run configurations, and the input lattice and grid
structures that will serve as starting references for the MC sampling. Then, the QM process is created
(step 2) via an execlp() call, and a UNIX pipe is opened between this process and the Host-Program to
enable inter-process communication. Following, the Host-Program will query the underlying hardware
for the available OpenCL platforms (step 3) and attempt to open an OpenCL context for each of them.
This discovery process respects several user-provided heuristics, such as allowing only certain device
types (e.g., GPUs, CPUs) or setting a maximum of selected devices. Next, the OpenCL buffers are
allocated on the selected device (step 4) and the starting references for the grid and lattice transfered
to the device (step 5) via an OpenCL command queue. Then, the first PMC Cycle (comprised of Kcycle
steps) is executed on the OpenCL device (step 6), and the resulting lattice configuration and ∆E term
read back to the Host-Program. The later then communicates these data to the QM Process via an UNIX
pipe (step 7), which then executes the QM Update (step 8). After this, the obtained grid configuration
is sent back to the Host-Program, which finally transfers it to the device, starting the next PMC Cycle.
This concludes 1 PMC iteration. The described work-flow is repeated for KPMC iterations, and then the
saved configurations are read back and printed to an output file. Since the OpenCL device may have
limited memory, the saved configurations are actually read back to the host from time to time, according
to the device maximum memory. Having the described work-flow in mind, the next subsection introduces
the developed OpenCL kernels.
4.3.1.A Load Balancing Among Multiple Markov Chains
Since the results produced by each Markov chain are equivalent, they may be sampled for a different
amount of steps, in respect to each other. Therefore, balancing the execution of the Markov chains
across different OpenCL devices is accomplished via a simple algorithm that works as follows:
1. Access a shared task-queue. If there are no tasks left, finish execution and Skip (2).
2. Execute task from task-queue.
In this approach, the balancing decision is distributed across the OCLDevice threads, although a cen-
tralized task-queue is employed to keep record of the available work-load. This algorithm doesn’t fit
perfectly in the classification scheme presented in Section 2.4, although it could be a considered as a
task-queue distributed balancing approach.
4.4 Data Structure Optimizations
Before describing in detail the fine-grained parallelism strategy (which will be presented in Chap-
ter 5), it is worth discussing the preliminary optimizations made on the original serial code. These
optimizations were employed to ensure that the obtained acceleration results were not inflated due to
under-performance of the serial baseline (more on baseline definition in Chapter 6). At this respect,
35
grid
Discover OpenCL DevicesCreate OpenCL
Buffers
Read grid file
QM Update
lattice
PMC Cycle
1
PMC Cycle
PMC Cycle
PMC Cycle
Init PMC Data
lattice
Δ E
gridgrid
lattice
Δ E
PMC Cycle
PMC Cycle
PMC Cycle
PMC Cycle
Launch QM update
Receive new grid
QM Update
lattice
Δ E
lattice
Δ ELaunch QM update
2
3
4
Host-ProgramOpenCLQueueUnixPipeQM Process
OpenCLDevice
gridgrid
PMC Cycle
PMC Cycle
PMC Cycle
PMC Cycle
Receive new grid
7
9
8
6
…...
...
...
Start Pipe QM
5
KPMCiterations
K cyclesteps
KCyclesteps
KCyclesteps
Read OutputConfigurations
Write Output File
10
11
1PMC
iteration
Figure 4.5: Program flow of the devised parallel PMC program, for the case of a single-device single-process instance (in order to keep the illustration clear). The legend for the numbered parts of this figureis presented throughout the text.
36
some algorithm modifications that were made in the parallel version were later ported to the serial base-
line, whenever such optimizations also lead to decreased serial execution time.
4.4.1 Indexing Molecules and Atoms
As introduced in Section 3, the pure MM electrostatic interactions, which are computed every PMC
Cycle step, consider the interaction between chmol and every other MM molecule stored in the lattice
(see Algorithm 2). In the original PMC implementation, the data structures employed to store the ele-
ments of the lattice were: i) 6 vectors with AMM entries, storing the parameters {x, y, z, σ, ε, q} for each
MM atom ii) a vector with AMM entries that returned the molecule id, given the atom index (henceforth
referred to as atom2mol). This approach caused many inefficient looping cycles, since one would have
to loop through every pair of atoms {i, j}, access atom2mol [i] and atom2mol [j], and then check if any
of these atoms belonged to chmol. As depicted in Algorithm 3, this would waste a lot of cycles just to
find the atoms that belong to chmol.
Algorithm 3 Original interaction loop: (A2MM−AMM )
2 cycles1: for each atom i ∈ [0, AMM − 1[ do2: for each atom j ∈ [i+ 1, AMM [ do3: if atom2mol[i]! =chmol and atom2mol[j]! =chmol then4: continue;5: end if6: compute interaction ...7: end for8: end for
To address the described inefficiency, an additional data structure was introduced, to allow map-
ping a specific molecule to the list of its respective atoms (henceforth referred to as mol2list). The
usage of this new structure reduced the total number of cycles for the MM interaction computations from(A2
MM−AMM )2 to Z(chmol)
MM ×AMM , which is a much smaller number (see Table 3.1). The resulting iteration
structure is presented in Algorithm 4. Since it enabled a faster execution of electrostatic computations,
this improvement was added to the performance baseline used in this work.
Algorithm 4 Improved interaction loop: Z(chmol)MM ×AMM cycles
Init: chmol atoms = mol2list[chmol]
1: for each atom i in chmol atoms : i ∈ [0, Z(chmol)MM [ do
2: for each atom j ∈ [0, AMM [ do3: compute interaction ...4: end for5: end for
The structure that was later used in the parallel version was slightly adapted, as depicted in Fig-
ure 4.6. Instead of returning a list with the member atoms, this new structure (henceforth referred to as
mol2atom) returns the index of the first atom belonging to the target molecule, which can then be used to
index the lattice vectors, which contain the {x, y, z, σ, ε, q} data. This structure is more suitable for GPU
platforms, since it keeps the fast access to the atoms of a target molecule that the mol2list structure
37
mol2atom
1th atom index
Molecule id
39 4041 42 43 44 4536 37 380 1 2 35...
ZQM=36 ZMM(0) =3 ZMM
(1) =4 ZMM(2) =3
{ x , y , z , q ,σ ,ϵ }
Identical accessMethod for the other variables
lattice
......
0363943...MM Molecules 1,2,3QM Molecule
Nuclei (0)
0123
Figure 4.6: mol2atom data structure, together with the lattice vectors. The mol2atom structure returnsthe index of the first atom belonging to the target molecule, which can then be used to index the latticevectors, which contain the {x, y, z, σ, ε, q} data.
provided, while also offering the possibility of reading the MM atoms directly from the lattice vectors in a
coalesced fashion.
4.4.2 Computing Distances
As introduced in Section 3, the Coulomb MM and VDW MM procedures include the computation of
the Cartesian distances between the chmol and every other MM molecule, in both the new (after the MC
step) and the old system configurations (see Algorithm 2). To save computing operations, the original
PMC implementation maintained two distance buffers, one to store the distances between all the atoms
in the reference system (old-dists), and another to store the distances in the configuration currently being
tested (new-dists). Both these buffers were implemented as a symmetric matrix with A2MM entries, such
that the entry {i, j} stores the same value as the entry {j, i}. By using this mechanism (depicted in
Figure 4.7, left), only the new-dists buffer had to be updated after a new MC step, since the distances
in the reference system were already stored in the old-dists buffer. Hence, only half of the distance
operations have to be executed, resulting in a total of Z(chmol)MM ×AMM calculations. However, to maintain
these buffers, additional memory operations had to be performed at the decision step: a) if the MC step
is accepted, the old-dists buffer has to be updated with the new distances computed for the chmol, b)
on the other hand, if the step is rejected, the new-dists buffer needs to be restored to its original state,
since it will have to be used again in the next step. Either of these options will result in 2 × Z(chmol)MM ×
AMM memory operations, since one has to restore every new/old-dists[m][n] entry for m = chmol, n ∈
[0, AMM [ and for m ∈ [0, AMM [, n = chmol. Considering the specific case of a GPU platform and a
new/old-dists buffer implemented as an 1D vector with A2MM entries, the first memory operations would
result in AMM coalesced memory writes, whereas the second memory operations would result in AMM
non-coalesced memory writes. The later might introduce significant overhead in a GPU platform, which
when also considering the quadratic memory requirement of these buffers (2×A2MM ), indicates that the
described approach to distance computation is not be suitable for GPU platforms.
In order to address this problem, the alternative approach presented in Figure 4.7 (right) was devised.
38
Distances(new)
old-dists
Monte Carlo Step
VDW MMCoulomb MM
Accept ?
chmol
new-dists
ΔE
…
old-dists new-dists
updateold-dists
yes no
On-the-fly Distances(old, new)
VDW MMCoulomb MM
Accept ?
ΔE
…
No needfor memoryoperations
here
yes no
2×ZMM(chmol)×AMM
memory operations
2×ZMM(chmol)×AMM
memory operations
2×ZMM(chmol)×AMM
distance operations
ZMM(chmol)×AMM
distance operations
Monte Carlo Step
chmol
Original Version On-the-fly Version
Data
Procedure
write read
Legend
restorenew-dists
Figure 4.7: Original approach to distance computation (left), together with the devised on-the-fly solution(right). For the sake of clarity, the distance computation procedures were singled out, although they areexecuted in the same computation loop as the Coulomb/VDW procedures. The remaining proceduresof the PMC Cycle step have been omitted for the sake of clarity.
This solution exploits the huge number of compute units available in typical many-core GPU platforms to
compute all the necessary distance operations in every iteration (on-the-fly ), totaling 2×Z(chmol)MM ×AMM
calculations. By computing these additional terms, the distance buffers ceased to be required, avoiding
both the quadratic memory requirement and the overhead of updating the distance buffers. Furthermore,
since those buffers are required to be persistent between MC iterations, using them in GPU platforms
would require reading and writing them from global memory2, whereas for the case of the on-the-fly
version, the distance values are generated and consumed in the local scope of the Coulomb/VDW MM
procedures, which results in trading many main memory operations for register operations. Moreover,
the on-the-fly version also proved to be more efficient in the CPU platform used as the baseline for this
work (more details in Section 6), thus it was also included there.
4.5 Summary
In this Chapter, a top-level description of the devised parallel solution was introduced. Firstly, the
original PMC QM/MM approach (developed at the Free Floater Research Group) was briefly described,
and the main pitfalls present in the original approach where commented on. Then, an introduction on ap-
plying Markov chain theory to the particular case of the PMC QM/MM simulation method was presented.
After this, the overall structure of the parallelization strategy was laid out, discussing details about the
structure of the developed OpenCL Host-Program and the work-flow of the complete application. Then,
a coarse-level load balancing solution to schedule the Markov-Chain workload among heterogeneous
2Considering the GPU platforms used in this dissertation.
39
devices was described. Finally, a few preliminary data-structure optimizations were discussed. A de-
tailed description of the developed OpenCL Kernels, as well as a second load balancing algorithm for
scheduling finer-grained workloads, will be presented in Chapter 5.
40
CHAPTER 5
FINE-GRAINED PARALLELISM ANDMULTI-DEVICE LOAD BALANCING
Contents5.1 PMC Cycle Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.2 Exploiting Single Markov Chain Parallelism . . . . . . . . . . . . . . . . . . . . . . . . 475.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
41
In this Chapter, a description of the devised parallel solution for extracting fine-grained parallelism
in the PMC Cycle is presented. At this respect, the OpenCL kernels developed for accelerating the
PMC Cycle procedures will be introduced and described. Following, a multi-device approach for ex-
ecuting the workload belonging to a single Markov chain is introduced, and the synchronization and
communication overheads are discussed. After this, a fine-grained dynamic load balancing solution is
presented.
5.1 PMC Cycle Parallelization
The OpenCL kernels that integrate the PMC Cycle are listed and mapped to the corresponding pro-
cedures in Figure 5.1. To minimize communication, the PMC Cycle procedures that share the same
input data (see Figure 3.4) were merged into the same kernel. Furthermore, the OpenCL version re-
quires additional kernels to finish the implemented parallel reductions. In order to keep track of the
kernel dependencies in respect to each other, OpenCL events were used to chain the kernel calls.
These kernels and the employed strategy for their parallelization will be discussed in more detail in the
next subsections.
VDW MM
Coulomb MM
Coulomb Nuclei QMMM
VDW QMMM
Coulomb Grid QMMM
Monte Carlo Step
Decide & Update
q3m_finish
q3m_c
q3m_vdwc
mm_vdwc
mm_finish
monte_carlo
decide_update
PMC Cycle Procedures OpenCL Kernels
OpenCL Event DependencyLegend:
QMupdate
PMCcycle
PMCcycle
QMupdate
PMCcycle
*
*
Figure 5.1: Mapping of the PMC Cycle procedures into OpenCL Kernels. It should be noticed that someprocedures were merged into the same kernel. Furthermore, the OpenCL version requires additionalkernels for the parallel reductions (mm finish and q3m finish, marked with a ∗).
Firstly, before entering in development details of each OpenCL kernel, the memory layout strategy
will be discussed. Figure 5.2 presents the memory layout for the main data structures used in the PMC
Cycle. The Host-Program will try to fit as many constant data in constant memory as possible, although
this memory is usually much more limited than global memory, and for most devices this will mean having
to place constant buffers in global memory. The layout depicted in Figure 5.2 is a possible instance of
such buffer distribution. As introduced in Section 3, the lattice is composed by 3 constant vectors q, σ, ε
and 3 non-constant vectors x, y, z, which are altered when the reference is updated in the decision
step. The first entries of these vectors hold the variables for the QM atomic nuclei (label 1, Figure 5.2),
whereas the later hold the MM atoms data. Furthermore, the mol2atom structure (see Figure 4.6) may
also be stored in constant memory (label 2). The grid buffers are also constant vectors by nature, since
42
(AMM+ZQM )×{x , y , z , q ,σ ,ϵ}
lattice
…
ConstantMemory
… …
… … …
… … …
… … …
GlobalMemory
…
mol2atom NMM
ZQM AMM
grid
NQM×{ x , y , z , q }
… …
…
NQM
…
1
3
2
Figure 5.2: Memory layout example for the main data structures used in the PMC Cycle.
they are not altered during the kernels execution (only by the QM process). However, the size of these
buffers is typically prohibitive (up to 320MB, according to Table 3.1) in respect to the available constant
memory of typical OpenCL devices, forcing the Host-Program to allocate these buffers in global memory
(label 3). All these data buffers were chosen to be represented as one dimensional vectors to allow a
contiguous placement in main memory, and to reduce the level of access indirection (i.e., use a single
pointer) as much as possible. As discussed further, this favors memory coalescent accesses.
5.1.1 Monte Carlo
The Monte Carlo procedure has little potential parallelism to be extracted, since it is mainly composed
by a fairly light and intrinsically serial operation: the random translation and rotation of a single random
molecule (chmol). Nevertheless, an OpenCL kernel (monte carlo) was developed to execute this task
in the OpenCL device, because this kernel manipulates data that will be used by the kernels that follow,
enabling the communication of these data via the device’s global memory, without needing to have the
Host-Program as an intermediary. Furthermore, since a random number generator would introduce
unnecessary overhead in the GPU1, the random numbers required for the MC perturbation (10 vectors)
are pre-generated in the Host-Program and sent to the OpenCL device. The size of these vectors will
depend on the device’s memory capabilities, and it is the Host-program responsibility to manage the
periodic refresh of these random lists, every Nauto steps.
As depicted in Figure 5.3, the monte carlo kernel starts by loading the necessary random parameters
from memory, then applies the perturbation to the randomly selected molecule (specified in vector rIDs
and loaded from the lattice structure), and finally writes the displaced molecule (chmol) to global mem-
ory. This structure carries the new {x, y, z} values for each atom, the molecule id (ID), and the number
of atoms of the chmol (Z(chmol)MM ). The chmol will be loaded from memory by the energy computation
1A simple pseudo-random generator such as the one provided by glibc, would require at least two global memoryaccesses for each random number vector: one to load the current generator sequence state and another to updatethis same state.
43
ConstantMemory
Nauto
… …
…
…
{rIDs ,θ , q0 , q1 , q2 , q3 , tn , t x , t y , t z }
wi
RandomMolecule ID
Rotationnumbers
Translationnumbers
Apply Perturbation
chmol
{ZMM(chmol)
×{x , y , z }, ID ,ZMM(chmol)
}
GGlobal
Memory
… … …
ZMM(chmol)
One vector per Random parameter
C
Get parametersfor current step
Work-itemwi
G
C Constant mem. access
Global mem. access
Legend
monte_carlokernel
lattice
G
Get moleculereference
Figure 5.3: Diagram for the devised monte carlo kernel, together with the layout of the data which ismanipulated in this procedure.
kernels, which are described in the next sub-sections.
5.1.2 Coulomb Grid QM/MM
The amount of parallelism that can be extracted within each kernel varies according to existing data
dependencies and the amount of input data. Accordingly, it is highest in the q3m c kernel, not only
because the Coulomb QM/MM energy interaction (Algorithm 1) is highly data-parallel, but also due to
the size of the grid it takes as input, which may vary from hundreds of thousands to millions of grid
points. The data partition scheme employed in the q3m c kernel consists in tasking each work-item
with computing the interaction between P grid points and the atoms belonging to chmol. While the
latter is the same for every work-item and might be loaded as a global memory broadcast, the former
consists of different load addresses for each work-item. In order to obtain coalescent memory accesses,
the grid partition shown in Figure 5.4 was employed. As depicted, the grid data is stored in four one-
dimensional independent vectors, one vector for each coordinate {x, y, z} and another vector for the
charge q. Each work-group performs P memory loads, where each work-item gets the vector addresses
localindex + wgsize × i (for i iterating from 0 to P − 1). Hence, by using this strategy, work-group grid
point loads always fetch contiguous addresses, thus achieving a coalesced memory access. It should
be noticed that although Figure 5.4 depicts an example for wgsize = 4 and P = 2, this is merely for
illustration purposes, as the optimal parameter choice is different according to the target OpenCL Device
44
grid
…
Coalescentmemory loads
…
NQM×{ x , y , z , q }
… …
wi wi wi wi
G G G G
G G G G
Work-Group 0
wi wi wi wi
G G G G
G G G G
Work-Group 1
P×wgsize
wgsize=4P=2
wgsize
P
Work-Group 2
wi wi wi wi
…
LegendWork-item
Global memory accesses
wi
G
wgsize=work-group size
P=grid points per work-item
Illustrativeexample for
Figure 5.4: Scheme used for partitioning the grid among the work-groups, in order to allow a coalescedmemory access pattern. For the sake of keeping the illustration clear, an example for P = 2 andwgsize = 4 is shown.
(see Chapter 6 for further details).
A diagram of the q3m c and q3m finish kernels is presented in Figure 5.5. Firstly, each work-item
loads from the global memory P grid points and the A atoms that comprise the chmol molecule. The lat-
ter corresponds to a global memory broadcast, whereas the former is performed by P coalesced global
memory read instructions, each reading a contiguous stripe of grid points to the work-group (step 1, in
Figure 5.5). Then, for each {atom, grid point} pair, the corresponding work-item computes the squared
Cartesian distance (according to a periodic box) and compares it with squared cutoffs, thus avoiding an
expensive sqrt operation. Depending on the resulting distance, the corresponding energy expression is
computed (see the cutoff branches in Algorithm 1) and the results are accumulated in private memory
(step 2). After this, work-items of the same work-group reduce the computed energies using local mem-
ory, by accumulating all terms in one memory address after log2(work-group size) iterations (label 3).
Then, the first work-item of each work-group writes the obtained partial result into global memory and a
final reduction kernel with only one work-group is launched (label 4), to reduce the remaining terms into
to a single value (different work-groups cannot communicate via global memory). Hence, by including a
first set of energy reductions in the same kernel as the ∆ECQM/MM energy computation (q3m c), expen-
sive global memory transfers that would otherwise be required between kernel launches are avoided.
Furthermore, all reductions are organized in order to favour warp/wavefront release, ensuring that half
of the active work-items finish their execution soon after each reduction iteration, thus promoting higher
GPU occupation. The corresponding reduction structure is presented in Algorithm 5.
45
q3
m_
c
Legend
wi wi wi wi
+
L L L L
+
wi wi wi wi
L L L L
+ +
+ +
+
... wi wi wi...
+
wi wi wi...
+
...
+
P
L
L
L
L
L L
G G G
wi wi wi...
+
GEC ( pi , a j )
dist (p i , a j )
EC
G
EC
G
EC
G
EC
G
EC
G
EC
G
EC
G
EC
1
2
4
Work-Group 0(Ilustrative example for 8 work-items) Work-Group 1 Work-Group W
3
G
ΔEQM/MMC,grid
...
PLG
wi Work-item
Local memory
Private memory
Global memory q3
m_
finis
h
Figure 5.5: q3m c and q3m finish kernels structure. In this example, work-group 0 was presented withadditional detail, although all work-groups share an identical structure. Likewise, the 8 work-items perwork-group configuration was adopted for simpler illustrative purposes, as the work-group size is fullyparameterizable. Furthermore, additional details concerning the first global memory accesses (label 1)are depicted in Figure 5.4.
5.1.3 Coulomb/VDW MM
The mm vdwc kernel has a similar structure to q3m c, except that it accounts for the interaction be-
tween the changed molecule (chmol) and the lattice, instead of the grid. In this kernel, the Coulomb and
the vdW interactions have been merged together, allowing the sharing of the result from the distance
computation of the same {atom, grid point} pair via private memory (registers) in the same work-item.
The reduction structure is the same as the one presented in Figure 5.5. The data structure optimiza-
tions discussed in Section 4.4 were herein employed, to avoid having to maintain a buffer to store the
distances. The parallelization structure is identical to the one presented for Coulomb Grid QM/MM (see
Section 5.1.2), apart from the involved data.
5.1.4 Coulomb Nuclei/VDW QM/MM
Unlike the other energy computation kernels, the q3m vdwc involves a much lower amount of input
data. By recalling from Section 3.2 that the complexity of the QM/MM Coulomb Nuclei and QM/MM
VDW procedures is Z(chmol)MM × ZQM , and by consulting Table 3.1, one can conclude that the amount
of loop iterations of these procedures falls somewhere in the order of magnitude of 102. Hence, if an
approach similar to the other energy computation kernels is followed, this means that the maximum
amount of work-items one can spawn will also fall in order of magnitude of 102. For this reason, the
reduction structure for q3m vdwc is simpler, as shown in Figure 5.6. In this kernel, only two work groups
46
Algorithm 5 Pseudo code for energy reduction.Init: : local size = Size of this work-group
Init: : local[localid] = Private∆EC,gridQM/MMenergy
1: for offset = local size/2; offset > 0; offset >>= 1 do2: if localid < offset then3: local[localid] = local[localid + offset] + local[localid];
4: Local Barrier. Wait for work–group.5: end if6: end for
Legend
wi wi wi wi
+
L L L L
+
wi wi wi wi
L L L L
+ +
+ +
+
+
P
L
L
L
L
L L
G
GEC ( z i , a j )
dist ( zi , a j)G G G G G G G
1
Work-Group 0(Ilustrative example for 8 work-items)
PLG
wi Work-item
Local memory
Private memory
Global memory
EC EC EC EC EC EC EC EC
wiwiwiwi
+
LLLL
+
wiwiwiwi
LLLL
++
++
+
+
P
L
L
L
L
LL
G
GEvdw ( zi , a j)
dist ( zi , a j)GGGGGGG
2 Evdw Evdw Evdw Evdw Evdw Evdw Evdw Evdw
Work-Group 1(Ilustrative example for 8 work-items)
3 ΔEQM/MMvdWΔEQM/MM
C,nuclei
Figure 5.6: q3m vdwc kernel structure. An 8 work-items per work-group configuration was adopted forsimpler illustrative purposes, as the work-group size is fully parameterizable.
are launched, one for the Coulomb Nuclei QM/MM procedure and another one for QM/MM VDW. The
former is tasked with computing and summing the EC Coulomb terms (label 1, Figure 5.6) for each QM
atom (zi) and chmol atom (aj), whereas the later is responsible for computing the Evdw terms (label
2), for the same atom pairs. Finally, when each work-group finishes execution, the accumulated energy
terms are written in global memory (label 3), to be subsequently read by the decide update kernel.
In contrast with the q3m c/finish kernels, a subsequent reduction kernel is not required, since each
energy term is reduced completely in one work group. Depending on the target OpenCL device, the
chosen work-group size might vary, and the data each work-item computes varies accordingly.
5.1.5 Decision Step
After all the energy computation kernels have terminated their execution, the decide update kernel
is launched. Figure 5.7 depicts the work-flow of this kernel. First, the accumulated results from the
previous kernels are read from global memory and added together (label 1, Figure 5.7). Then, the
step is accepted if the energy of the obtained configuration is lower than the previous configuration
reference, or accepted with a probability e∆E
KBT if the energy of the system has risen. Otherwise, the step
is rejected. For the case of accepted steps, the chmol configuration under test is copied to the current
lattice reference (label 2). Regardless of this decision, the current system configuration is saved (label
47
GG GG G
wgsize= 8
Work-item
Global mem.accesses
Illustrative example for
Work-Group 0
wi wi wi wi wi wi wi wi
Update Reference
y
G G G
G
Σ
accept?
save?
y
GG GG GG G G
GG GG GG G G
3×(AMM+ZQM)
wgsize
Exit Kernel
{ x , y , z }
wi
G
Update lattice vectors
1
2
3
4
Legend
Idle Work-item
everyFXYZ step :
Figure 5.7: decide update kernel diagram. An 8 work-items per work-group configuration was adoptedfor simpler illustrative purposes, as the work-group size is fully parameterizable.
4) to global memory every Foutput steps (see Table 3.1). This step saving operation takes 3×(AMM+ZQM )wgsize
cycles, and follows a coalesced memory write pattern. The kernel finishes execution either after this
memory operation (label 4), or immediately after the step has been decided (label 3). Furthermore, since
the saved configurations will occupy a fair amount of memory in the OpenCL device (each configuration
taking 3 × (AMM + ZQM ) numbers), the host is responsible for reading these buffers back to main
memory from time to time, and write them to an output file.
Since the typical range for the parameter Foutput is fairly high (see Table 3.1), the employed paral-
lelization scheme in the step saving will not have much impact. Nevertheless, it was implemented for
the sole purpose of having faster Debug runs, were one might want to print every step (Foutput = 1) to
see how the QM/MM system is evolving with higher granularity. This is an important feature for code
maintainability.
5.2 Exploiting Single Markov Chain Parallelism
As introduced in Chapter 4, each Markov chain represents one Monte Carlo state-space exploration
instance. The particular case of having a single Markov chain corresponds to the work-flow which was
depicted in Figure 4.5. In this approach, the QM Update and the PMC Cycle depend strictly on the
previous PMC iteration, thus only one of these procedures can be executed at a given time (as depicted
in Figure 4.5). Nevertheless, a higher amount of parallelism can still be extracted by running the PMC
Cycle instance, respecting to the same Markov chain, on multiple OpenCL devices. The workload
48
MC
VDWMM
CoulombMM
Decide & Save
Reduction
OpenCLDevice
0
Coulomb QM/MM (n.)
VDW QM/MM
MC
%Coulomb Grid
QM/MM
Decide
Reduction
Read partial ∆E's (x2) –––>
<––– Write accumulated ∆E's
Host CPU
thread 0 thread 1
Launch overhead Launch overhead
<––– Read partial ∆E's
Write accumulated ∆E's (x2) –––>
R/W Launches
%Coulomb Grid
QM/MM
ReductionR/W Launches
Barrier Sync & Sum Partials
Other CPU threads are running the QM updates in parallel.
OpenCLDevice
1
MC
%Coulomb Grid
QM/MM
Decide
Reduction
OpenCLDevice
Nthread N
G0
G1 GN...
...
QMupdate QM
update
QMupdate
Figure 5.8: Exploiting multiple heterogeneous OpenCL devices to execute the PMC Cycle. The exe-cution is balanced by executing different kernels on each device and dividing the work of the heavierkernels (q3m c and q3m reduce).
distribution of a single Markov chain among multiple OpenCL devices is discussed in the next sections.
5.2.1 Multiple OpenCL Devices
As discussed in the Section 3.2, the most computational intensive part of each PMC Cycle step
corresponds to the computation of the ∆EC,gridQM/MM energy term. In the presented OpenCL approach, this
energy calculation is handled by the q3m c and q3m finish kernels (see Section 5.1.2). Furthermore,
according to the dependency chart depicted in Figure 3.4, the procedure which these kernels execute
(Coulomb Grid QM/MM) only depends on the chmol data structure from the MC step and on the grid
data (which is written once to the OpenCL device at the start of each PMC Cycle execution). Moreover,
typical grids have hundreds of thousands to millions of points, allowing for a fine-grained partition among
devices. Accordingly, all those conditions make these two kernels excellent candidates for multi-device
acceleration.
Figure 5.8 illustrates the employed multi-device parallelization approach for a generic heterogeneous
system composed by a host CPU and N different OpenCL devices. In this approach, the Host is re-
sponsible for syncing operations between OpenCL devices, which share partial energy results on every
iteration. In this particular example, device 0 is running all kernels, although q3m c and q3m reduce
only compute part of ∆EC,gridQM/MM. Devices 1 to N , which might be accelerators with different compute
capabilities, calculate the remaining terms of ∆EC,gridQM/MM. The relative performance of the accelerators (in
respect to each other) will determine the fraction of the grid each one gets (G0 % to GN %) and where
the least complex energy computation kernels are scheduled to.
In order to keep synchronization overhead to a minimum, every device computes the MC and de-
cision kernels redundantly, although only one of the devices is responsible for saving the the sampled
configuration, since this is the heaviest part of this procedure (see Section 5.1.5). The overhead as-
sociated with the device synchronization, to be executed at every step, is caused by several factors.
Firstly, to read and write the partial energies of each device, one has to call the OpenCL functions
49
enqueueReadBuffer and enqueueWriteBuffer, which also include an implicit clFinish to wait for the pre-
vious kernels in that step to finish (launches are chained using OpenCL Events). This is accounted for
in the R/W Launches block, in Figure 5.8. Secondly, each memory transfer introduces a small overhead
corresponding to a copy of one floating-point number per reduced energy term. The number of com-
municated terms ranges from 1 to 2 terms per device, according to the employed partitioning, since it
depends on which device is computing the lighter kernels. Finally, syncing the Host-side threads that
are managing the OpenCL accelerators (Barrier Sync) and launching and parametrizing the OpenCL
Kernels (Launch overhead) also introduces some overhead.
The multi-device synchronization overheads discussed earlier do not scale with the problem size,
depending only on the number of devices that the Host-Program has to manage. Although the Host-
Program will allocate a dedicated thread for each device (see Section 4.3.1), they will compete for the
Host resources, and the effective Host-thread parallelism may degrade. Therefore, these overheads
have a complexity of O(Ndevices), although for a small number of devices in respect to the maximum
number of parallel threads that the Host CPU can run, these complexities will be in practice sub-linear
in respect to Ndevices. Table 5.1 presents the complexity of the discussed overheads, together with two
other overheads: random list refreshing (see Section 5.1.1) and output flushing (see Section 5.1.5). The
former depends on the random list refresh frequency (10 arrays with Nauto entries, every Nauto steps),
whereas the later depends on the number of saved QM/MM systems that the OpenCL device can hold
on its global memory (Nsystems), since the host will have to read back these systems before the available
memory runs out (every Nsystems steps). Furthermore, each saved system configuration has 3 arrays of
size AMM +ZQM (see Section 5.1.5), which results in the final expression presented in Table 5.1. As for
the dependence on the number of OpenCL devices, the same rational developed earlier is applicable.
Table 5.1: Complexity of communication and synchronization overheads, in respect to the QM/MM sys-tem characteristics and to run parameters.
Overhead Overhead Complexity per PMC Cycle Step
Launch Overhead
O(Ndevices)R/W Launches
Read partial ∆E
Write partial ∆E
Refresh Random Lists O(Ndevices ×
Nauto
Nauto
)Flush Output O
(Ndevices × (AMM + ZQM )× Nsystems
Nsystems
)
5.2.2 Dynamic Load Balancing
To account for the possible heterogeneity of the computational platform, the amount of grid data that
is assigned with each device on each iteration is chosen according to a dynamic load balancing algo-
rithm. At this respect, considering the classification scheme for load balancing algorithms presented in
Section 2.4, the algorithm herein described is a centralized predicting-the-future dynamic load balancing
50
approach. Accordingly, Figure 5.9 depicts the work-flow of this solution, which starts from an unbal-
anced load distribution and eventually converges to a balanced work-load distribution, after J iterations.
Furthermore, the balancer continues to monitor the performance of the computing nodes, to ensure that
the work-load distributions continues to be optimal. This solution was based on one of the algorithms
presented in [12], for the case of constant data balancing problems.
Node 0
...
Central Node
load
...
Work-loaddistribution converged
Node 1
Node K
load
load
Node 0
...
load
Node 1
Node K
load
load
Iteration J
Node 0
...
load
Node 1
Node K
load
load
Central Node
Central Node
time
time
time
time
time
time
...
...
...
...
...
time
time
time
...
...
i) Measureperformance
ii) BalanceLoad
i) Measureperformance
ii) Load distributionconverged
i) Measureperformance
ii) ... Legend
Measure Performance
Re-schedule Load
After J Iterations with balancing every S iterations.
Iteration 0
Iteration J+S
InitialWork-loaddistribution
Figure 5.9: Work-flow of the centralized predicting-the-future dynamic load balancing solution employedin this dissertation.
In order to apply this approach for the particular case of the q3m c/finish kernels, the 3D grid is
divided into n small and independent grid blocks. In the first step, all p devices are assigned the same
number of blocks d0i = n/p. Then, for every r steps, this distribution is conveniently updated. Thus, at
step k, device 1 computes the grid blocks b1, ..., bdk1, device 2 computes blocks bdk
1+1, ..., bdk1+1+dk
2, and
so on. All devices have access to all grid points, so that data displacement is not required.
Let ti(dki ) be the time taken by device i to compute the assigned dki blocks (plus the remaining ker-
nels it has been assigned with) in iteration k. The implemented load balance works as follows:
1. if maxall device pairs(i,j)
∣∣∣∣ ti(dki )−tj(d
kj )
ti(dki )
∣∣∣∣ < ε, the load is balanced. Skip (2).
2. Recompute the amount of assigned grid blocks: dk+1i = n× dk
i
ti(dki )×
∑pj
dkj
tj(dkj
)
At this point, it is important to recall (see Section 3.1) that different cutoff regions may result in differ-
ent energy expressions (or no energy computation at all). Consequently, the accelerators might com-
pute over grid partitions that fall into different cutoff regions, thus having different computational efforts.
51
Hence, execution time measurements should be computed by averaging the execution time in several
previous steps, in order to avoid any misclassification of device performance. The described algorithm
is run by the Host once every Fbal steps, after the Barrier Sync (see Figure 5.8). Since the compu-
tational cost of a given step depends on the chmol position in space (which defines the cutoff region
center), Fbal should be in the order of magnitude of NMM (103, see Table 3.1), to ensure that in average
every molecule has been moved once between balancing steps. Furthermore, in order to be able to
make the balancing decision, the performance measurements of each device are shared between the
corresponding Host-threads, via Host shared-memory.
5.2.2.A Problem Partitioning Approaches
In the literature on accelerating MD simulations, additional data structures which allow a fast verifica-
tion of interaction cutoffs are sometimes employed. A typical approach consists in using a neighbor-list
method [4, 56]. A neighbour-list is a data structure (e.g. Matrix or vector of Linked Lists) that records,
for each particle, the particles that are close enough to it, according to a cutoff. This spares unneces-
sary computation over molecules which are already known to be outside the interaction cutoff. Another
possible approach consists in a cell-list method [49], which employs a similar data structure, although
the space is partitioned in geometrical cells, instead of the more relaxed (in the geometrical sense)
neighbor-list approach. Although these works target MD simulations, a similar approach could be un-
dertaken for the PMC QM/MM (which is an MC method) to partition the QM grid points. However, since
the grid has to be updated every PMC (outer) iteration (by the QM Update), it is not obvious if a cell-list or
neighbor-list approach would bring additional performance gains, since the overhead of creating these
lists would have to be repeated every PMC iteration. The study of a possible grid partitioning approach
is left for future work.
5.3 Summary
In this Chapter, a description of the devised parallel solution for extracting fine-grained parallelism
in the PMC Cycle was presented. At this respect, the developed OpenCL kernels were introduced and
described, and the devised OpenCL memory layout structure was presented. Since the heaviest kernels
correspond to q3m c and q3m finish, a higher amount of detail was given to the description of these
procedures. Then, a multi-device approach for executing the workload belonging to a single Markov
chain was introduced, and the synchronization and communication overheads discussed. Finally, a fine-
grained dynamic load balancing solution to efficiently take advantage of heterogeneous OpenCL devices
was presented.
52
CHAPTER 6
EXPERIMENTAL EVALUATION
Contents6.1 Benchmarking Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.2 PMC Cycle Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.3 Global PMC Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.4 Numerical Evaluation: Convergence Accuracy and Energy Consumption . . . . . . 636.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
53
Table 6.1: Considered QM/MM benchmark datasets. The chemical aspects of bench-R are presentedin detail in [11].
Benchmark MM Part (NMM ) QM Part Grid size (NQM ) Total MC stepsbench-R 500 H2O molecules Chorismate 183,356 24.8×106
bench-A 1301 H2O molecules 1 Arginine 1,772,972 10×103
bench-B 5000 H2O molecules 1 Arginine 1,772,972 10×103
bench-C 5000 H2O molecules 2 Arginines 2,992,458 10×103
In this Chapter, a detailed performance assessment of the devised parallel heterogeneous approach
to the PMC QM/MM method is presented and discussed. Firstly, the considered performance metrics
and the employed profiling tools are described. Then, the benchmarking setting will be discussed, both
in terms of the considered chemical datasets and the target hardware platforms. Following, the perfor-
mance baselines are defined, establishing a term of comparison for the obtained acceleration results.
After this, the acceleration results obtained both in the simulation bottleneck (PMC Cycle) and in the full
simulation are presented and discussed, and the scalability of the devised solution is evaluated. Finally,
the numerical quality and energy consumption are evaluated for alternative numerical representation
schemes.
6.1 Benchmarking Setup
6.1.1 Chemical Datasets
To experimentally evaluate the proposed parallelization approach, four sets of chemical datasets
were carefully designed by chemical experts from the Institut fur Physikalische Chemie, Georg-August-
Universitat Gottingen. The three benchmarks, namely bench-A, bench-B and bench-C (see Table 6.1),
represent typical QM/MM setups, that will be mainly used to assess the performance of the most de-
manding simulation step (i.e. PMC Cycle), by running 10k steps. The QM part of these benchmarks
consists of a set of protonated arginines that are acylated at the N-terminus and methylaminated at the
C-terminus. This amino acid was solvated in a periodic water box (MM part), containing a variable num-
ber of water molecules (depending on the considered benchmark). The grid for the electronic charge
density description was constructed by following Mura and Knowles (α = 1 and m = 3) for the radial
distribution [36] and Lebedev (lmax = 53) for the angular distribution [27]. The QM calculations used the
density functional PBE [43] and the basis set def2-SVP [54], while the MM part was described with the
OPLS-AA force field [25]. Furthermore, the latest development version of the MOLPRO [55] program
package was used in the QM calculations.
The fourth benchmark, bench-R, consists of a smaller simulation box, designed for a much longer
and realistic run. This simulation corresponds to the chorismate molecule in solution. Its conversion to
prephenate is a widely studied biochemical reaction, and the respective chemical aspects are described
in [11]. For this benchmark, the run is comprised of 24.8 million steps, with the QM update executed
every 50k steps (totalling 496 PMC outer iterations).
54
Table 6.2: Considered execution platforms in the experimental evaluation.
Platform Host CPU RAM(CPU) OpenCL Accelerators RAM(Accel.)mcx0 Intel Core i7-4770K 4-core 3.5GHz 32GB - -mcx1 Intel Core i7-4770K 4-core 3.5GHz 32GB Nvidia GTX 780Ti 3GBmcx2 Intel Core i7-4770K 4-core 3.5GHz 32GB Nvidia GTX 780Ti/660Ti 3GB/2GBmcx3 Intel Core i7-3820 4-core 3.6GHz 16GB AMD R9 290X/Nvidia 560Ti 3GB/1GBmcx4 2x Xeon E5-2609 (4-core each) 2.4GHz 32GB 2x Nvidia GTX 680 4GB/4GBmcx5 Intel Core i7 3770K 4-core 3.5GHz 8GB Nvidia K20C 5GBmcx6 Intel Core i7-4770K 4-core 3.5GHz 32GB i7-4770K 32GB
6.1.2 Hardware Platforms
The considered hardware for the experimental setup is listed in Table 6.2. The considered plat-
forms correspond to several hardware configurations of the machines available at the SiPS research
group, which include Intel i7 CPUs, Nvidia GPUs and AMD GPUs. These platform configurations were
selected to allow a fairly complete evaluation of the devised parallel solution: i) mcx0 will be used as
the performance baseline (more details in Section 6.1.3), ii) mcx1 and mcx2 were selected to evaluate
the load balancing solution between two GPUs with very different compute performances (GTX 780Ti
and GTX 660Ti), iii) platform mcx3 was selected to evaluate the performance of an highly heteroge-
neous system composed by GPUs of different vendors (AMD R9 290X and Nvidia 560Ti), iv) mcx5 will
mainly be used to assess energy consumption (since it supports NVML power measurements), v) mcx6
will be used to evaluate the parallel OpenCL solution when running on a multi-core CPU. In the pre-
sented platform configurations the Host-CPU will be both managing the OpenCL devices and running
the QM Updates, using all the available cores.
Different OpenCL work-group partitioning schemes were used for each device. For Nvidia GPUs,
the CUDA calculator [41] proved to be a useful tool for choosing starting point parameters. For AMD
cards and Intel CPUs, the optimal values were found through test and experimentation, resulting in
small multiples (e.g. 1 to 4) of the preferred elementary work-group size returned by an OpenCL device
discovery query, made in runtime to the underlying platform. The newest available OpenCL standard
was used for each device (OpenCL 1.1 for the considered Nvidia GPUs and OpenCL 1.2 for the Intel
CPUs and AMD GPUs).
6.1.3 Performance Baseline
The original PMC QM/MM single-core code was reviewed and optimized, to ensure that the obtained
acceleration results were not inflated due to under-performance of the serial baseline. Accordingly, the
optimizations discussed in Section 4.4 were added to the original algorithm. Most of the performance
comparisons presented in this Chapter are relative to the optimized version of the reference code ex-
ecuted on a single core of the i7-4770K processor (platform mcx0), compiled with Intel compiler (ICC
v13.1.3) with flags -O3 -xCORE-AVX2, unless otherwise specified. This baseline will henceforth be re-
ferred to as avx2-baseline. Figure 6.1 and Figure 6.2 illustrate a profiling evaluation of the avx2-baseline,
using the bench-A input dataset. In particular, Figure 6.2 presents the overall execution results for one
PMC iteration, whereas Figure 6.1 depicts a more detailed overview of each step of the simulation bot-
55
Monte Carlo Step
Coulomb QM/MM
VDW QM/MM
VDW MMCoulomb MM
+
Time perstep (μs)
32
356435
2
2
76077
94
CoulombGrid
QM/MM
Δ EMMC
Δ EMMvdW
Δ EQM/MMvdW
Δ EQM/MMC,nuclei
Δ EQM/MMC,grid
QMupdate
PMCcycle
PMCcycle
QMupdate
PMCcycle
PMC Cycle
Update Reference
Output Result
yAccept ?
Δ E
Figure 6.1: Time footprint for a single PMC Cycle step for the bench-A dataset running on the avx2-baseline.
0
50
10
0
15
0
20
0
25
0
30
0
35
0
40
0
45
0
50
0
55
0
60
0
65
0
70
0
75
0
80
0
85
0
PMCCycle
time(s)QM
Update
1IPMCIouterIIteration
Figure 6.2: One complete PMC outer iteration, comprised of 10k PMC Cycle steps and a QM Update,for the bench-A dataset running on the avx2-baseline. The bottleneck of each PMC iteration is the PMCCycle.
tleneck. As predicted in Section 3.2, the Coulomb Grid QM/MM procedure (∆EC,gridQM/MM) represents (for
all the tested input QM/MM systems) the most time consuming part of each PMC Cycle step, since
O(Coulomb Grid QM/MM) = Z(chmol)MM ×NQM and NQM tends to be a very large number (1, 772, 972, for
the case of bench-A).
Furthermore, both double-precision (fp64 ) and mixed double and single-precision (fp64 -fp32 ) data-
types will be employed in the performance study made in this section. Details about these numerical
configurations and the corresponding compromises, as well as a mixed fixed-point precision approach,
will be discussed in Section 6.4.
6.2 PMC Cycle Acceleration
The main performance metric of choice is the execution time of the accelerated application. However,
to further show the benefits of the proposed parallelization approach, it is herein adopted the application
speed-up in respect to the baseline serial execution:
Speedup =TbaselineTparallel
(6.1)
56
where Tbaseline is the execution time of the baseline performance, corresponding to the serial execution
on the host CPU, when compiled in Intel compiler (ICC v13.1.3) with flags -O3 -xCORE-AVX2, such as to
enable automatic loop vectorization and the usage of AVX2 vector instructions in compliant processors
(e.g., Intel 4th generation core i7). Furthermore, Tparallel represents the execution time of the proposed
solution using the system under test. In order to measure Host-side execution times, the PAPI [35]
library is used. For evaluating kernel execution time and buffer transfers to the OpenCL devices, OpenCL
Profiling Events are used instead, since they allow a finer measurement of OpenCL device operations. In
order to identify execution bottlenecks and guide the process of algorithm acceleration, the kcachegrind
tool (based on valgrind [37]) was employed.
Table 6.3 presents the PMC Cycle execution time (10k steps) for benchmarks bench-A, bench-B and
bench-C, profiled for several hardware configurations. The overall execution time corresponds to the
cost of running 10k steps plus the final output flushing from the OpenCL device back to the host and file
writing (Output time in Table 6.3, ranging from ∼0.5s to ∼2s). The extra overheads related to the time for
the OpenCL initialization and input file reading was not accounted for, because they do not scale with the
simulation size and would be diluted in longer runs (contrary to the output generation). Since the amount
of generated output scales with the number of executed steps, this overhead would repeat itself every
10k steps (for this particular run), and therefore it is taken into account in the speed-up calculations.
Table 6.3: Execution time (in seconds) for a PMC Cycle with 10k steps, in several hardware platforms,when using fp64 -fp32 mixed-precision. The column ”Total” corresponds to the complete execution timesof the PMC Cycle (10k steps), including the final serial overhead of reading back and writing the outputto a file. This overhead is discriminated in column ”Output”. The presented execution times correspondto a median among four experimental trials, for each platform configuration.
bench-A bench-B bench-CPlatform Accelerators Total Output Total Output Total Outputmcx0 none (avx2-baseline) 769.96 0.231 787.68 0.755 1179.70 0.760mcx6 i7-4770k 137.90 0.899 140.25 6.751 232.20 6.873mcx1 780Ti 6.33 0.534 8.08 1.699 11.20 1.717mcx2 780Ti/660Ti 5.04 0.517 6.73 1.657 8.68 1.656mcx3 R9 290X/ 560Ti 6.90 0.572 8.42 2.015 11.30 2.019
The difference between the execution time of bench-A and bench-B corresponds to the number of
MM molecules, which has two implications, namely bench-B imposes a heavier footprint of the mm vdwc
and mm finish kernels, and an increased size of the generated output, which in turn means a heavier
decide kernel and a longer output flushing. The latter can be observed in Table 6.3 and mainly depends
on Host-to-Device communication speed to read back the output, and on the time to write the output file.
Consequently, it is higher in the parallel platforms, since the output has to be read back from an external
device (in respect to the OpenCL-Host).
On the other hand, bench-C has a larger QM part, resulting in heavier q3m c and q3m finish kernels.
This favors the overall performance in respect to bench-B, as the performance of the most data-parallel
kernels is favored by a higher number of grid points. The speed-up results of the parallel platforms in
respect to the avx2-baseline (corresponding to the execution times presented in Table 6.3), are depicted
in Figure 6.2.
57
780Ti
5.08
105.36
135.96
104.38
5.62
97.50
117.06
93.60
5.58
121.69
152.89
111.59
0 50 100 150
I7-4770k
780Ti
780Ti/660ti
R9 290X/560Ti
OpenCLAccelerators
bench-A
bench-B
bench-C
8threads
mcx3
Platform
mcx2
mcx1
mcx6
Speed-up versus avx2-baseline
®
Figure 6.3: Speed-up obtained for a PMC Cycle with 10k iterations, when using fp64 -fp32 mixed-precision. The Corresponding execution times are presented in Table 6.3.
According to the presented results, the speed-up values in the PMC Cycle acceleration are fairly
high when compared to avx2-reference. This is a direct consequence of a careful exploitation of the
memory hierarchy, together with the higher memory bandwidth in GPUs architectures. In fact, although
CPUs compensate the lower main memory bandwith with multiple levels of high-speed caches, the most
intensive procedure in the PMC Cycle (Coulomb Grid QM/MM) requires loading a huge amount of data
from main memory at each step (e.g., up to 48MB for the case of bench-C), rendering the first cache
levels useless. Nevertheless, coalesced memory accesses still exploit parallelism when accessing the
main GPU device memory, regardless of using local caches or not.
Table 6.4 presents kernel execution times for the particular case of the GTX780Ti accelerator, to-
gether with the times corresponding to the reference implementation in the avx2-baseline platform. As
can be observed, the kernels that achieve the highest speed-up are q3m c and q3m reduce, as pre-
dicted in Section 5.1.2. The very large speed-up attained in these kernels (160.84×) is subsequently
affected by Amdahl’s Law (considering the fractions and speed-ups of all the other kernels) and results
in an overall PMC Cycle step speed-up of 135.55×. By recalling the execution times presented in Ta-
ble 6.3 for the particular case of the GTX780Ti accelerator, the speed-up without considering the Output
overhead would be 769.96−0.2316.33−0.534 = ∼132.8× (versus the value of 121.29× presented in Figure 6.2, were
every component is taken into account), which is slightly below the speed-up attained in the PMC Cycle
step, due to device management and kernel launching overheads, not accounted for in Table 6.4.
Furthermore, two additional details are worth commenting on. Firstly, the monte carlo kernel is faster
on the GPU, since it relies on pre-generated random number lists, which are computed by the Host in
parallel and refreshed when necessary. Conversely, the baseline version is computing these numbers
on-the-fly, resulting in a heavier Monte Carlo step. Secondly, the decision kernel is also faster because
the results are accumulated locally and only read back and written in a file from time to time, thus being
contemplated in the Output fraction of the profiling (see Table 6.3).
58
Table 6.4: Kernel execution times obtained in the GTX780Ti accelerator and the in the reference avx2-baseline platform, for the particular case of bench-A. The speed-up in respect to the avx2-baseline isalso presented, together with the fraction of the PMC Cycle (%) each kernel represents.
Platform PerformanceKernel avx2-baseline GTX780Ti Speed-upmonte carlo 32us 0.04% 17us 3.0% 1.88×q3m c/finish 76077us 98.8% 473us 83.3% 160.84×mm vdwc/finish 791us 1.03% 40us 7.0% 19.77×q3m vdwc 4us 0.01% 18us 3.2% 0.22×decide 94us 0.12% 20us 3.5% 4.70×total 76998us 100% 568us 100% 135.55×
6.2.1 PMC Cycle Load Balancing
Figure 6.4 presents the kernel timing results per PMC Cycle step, when considering bench-A execut-
ing on the mcx2 heterogeneous platform. The load balancing algorithm introduced in Section 5.2.2 was
used and converged to the grid partitioning depicted in this figure. Figure 6.5 illustrates the time evolution
of the workload balancing. Here, the balancing term r is set to 2000 iterations, in order to avoid under-
sampling the computational weight of the q3m/mm kernels (which depends on the randomly picked MM
molecule). The starting workload distribution of 50%/50% converges to approximately 71%/29% in only
4 balancing steps, favouring the more powerful 780Ti GPU. When this distribution is reached, one can
observe that the execution of the balanced workload in each GPU takes practically the same time, which
means that the load is balanced and that the balancing mechanism has met its purpose. It is worth re-
calling that the employed balancing solution was designed to distribute the workload of the q3m c/finish
kernels (corresponding to the Coulomb QM/MM part in Figure 6.4), although the measurements taken
into account to make the balancing decision include all the other kernels and overheads, since one
wishes to balance each PMC Cycle step. In order to illustrate how the balancing persists even after the
10k-th run, the chart represents the execution up to 20k steps. When compared with an unbalanced
run (e.g., fixed 50%/50% workload distribution) on the same platform, the balanced version yields a
speed-up of 1.3×, further justifying the advantage of having incorporated a load balancing solution.
6.2.2 PMC Cycle Scalability
The memory footprint of the PMC Cycle kernels in the OpenCL accelerators is mainly limited by
the program output, pre-generated random lists, MM lattice and the QM grid. The first two solely de-
pend on the number of executed steps, and are addressed by having the Host CPU flushing the output
and refreshing the random lists periodically. The second two were also not a problem for the selected
benchmarks, since the largest used QM grid and MM lattice occupy ∼48MB and ∼160KB, respec-
tively. Nevertheless, it is important to note that the scalability of the proposed implementation is not
compromised even when significantly larger simulations are considered. To address such cases, the
following solution is envisaged: the q3m/mm kernels may concurrently execute over one chunk of data
while the Host CPU is transferring the next chunk. This double-buffering mechanism can be achieved
in OpenCL devices by using a second OpenCL command-queue and another Host CPU thread to issue
59
GTX780Ti
MC
VDWMM
CoulombMM
Decide & Save
Reduction
GTX660Ti
Coulomb QM/MM (n.)
VDW QM/MM
MC
71%Coulomb QM/MM
Decide
Reduction
Read partial ∆E's (x2) –––>
<––– Write accumulated ∆E's
Host GPP
thread 0 thread 1
Launch overhead Launch overhead
<––– Read partial ∆E's
Write accumulated ∆E's (x2) –––>
R/W Launches
29%Coulomb QM/MM
ReductionR/W Launches
Barrier Sync & Sum Partials
Avg. Time per PMC cycle step (μs)
Host-side overhead
OpenCLtime
16
45
5
19
227
15
2.6
0.6
20
28
46
22
Avg. Time per PMC cycle step (μs)
Host-side overhead
OpenCLtime
17
313
29
1.6
1.2
9
18
43
17
Other GPP threads are running the QM updates in parallel.
Figure 6.4: OpenCL kernel timings (per step) for the PMC Cycle running on the mcx2 heterogeneousplatform. The load is balanced for the heavier kernels (q3m c/q3m finish, corresponding to CoulombQM/MM), whereas the lighter kernels were scheduled to the first GPU. The considered benchmark isbench-A, using mixed fp64 -fp32 precision.
0100
200
300
400
500
600
700
800
900
1000
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000Step
780Ti660Ti
36.5 31.2 29.6 29.0 28.8 28.8 28.4 28.6 28.7
63.5 68.8 70.4 71.0 71.2 71.2 71.6 71.4 71.3
780Ti
660Ti
0[
10[
20[
30[
40[
50[
60[
70[
80[
90[
100[
50.0
50.0
Wo
rklo
adPd
istr
ibu
tio
n
Tim
ePp
erPP
MC
PCyc
lePs
tep
P[u
s]
WorkloadDistribution
Time per PMC Cycle step [us]
Figure 6.5: Convergence pattern of the implemented load balancing algorithm (balancing every 2000steps), for the Bench-C running on the GTX 780Ti/660Ti platform (mcx2). The presented PMC cycletime measurements represent mean times since the previous balancing.
the memory transfer operations. Since all the considered QM/MM systems have a memory footprint far
below than the maximum memory available in the considered acceleration platforms, implementing this
double-buffering mechanism was not considered a priority in this dissertation.
Other overheads worth discussing are the synchronization events related to scheduling the com-
putations belonging to a single Markov chain among multiple GPUs. These come with the additional
overhead of synching the PMC Cycle step results among the involved devices, at the end of every
step. Fortunately, these overheads do not scale with the simulation size, since the buffers that need to
be synched back and forth (between the Host and the accelerators) are reduced energy terms, each
represented by a single number. For the particular example presented in Figure 6.4, each device has
to read/write three reduced terms per step, which has a performance impact of a few dozen micro-
seconds. Conversely, the computational cost of the q3m c/finish kernels scales with the size of the QM
60
1.56
1.57
1.58
1.59
1.6
1.61
1.62
1.63
1.64
4.5 5 5.5 6 6.5 7 7.5 8 8.5
spee
d-u
p(o
f(ad
din
ga(
seco
nd
(GTX
68
0
Grid(Size((Milions(of(Points)
Figure 6.6: Scalability of the PMC Cycle when changing the size of the QM part in bench-A. Speed-upresults are presented for a dual GTX680 system in respect to a single GTX680 (platform mcx4).
grid, meaning that multi-device scalability is better for larger grids (which concern the most computation-
ally challenging problems).
Figure 6.6 presents the obtained speed-ups for the PMC Cycle kernels acceleration, when a second
GPU is added to the system (platform mcx4) to balance the same Markov chain, by considering several
QM grid sizes. As can be observed, multi-device performance scales better for simulations with greater
QM parts, which actually represents a common characteristic of real QM/MM systems. Considering
the high speed-up results obtained in the q3m c/finish kernels, one might expect the execution time of
these kernels to rise slowly with the introduction of larger grids, justifying the slow scaling of the dual-
device speed-up curve presented in Figure 6.6. Nevertheless, it is important to recall that the execution
time of the PMC Cycle kernels (for each Markov chain) was already accelerated to the micro-second
order of magnitude for a single device, and that at this level, every overhead is noticeable. Therefore,
the observed dual-device speed-up, ranging from ∼1.5× to ∼1.6×, is deemed favorable for grids up
to 8 million points. This speed-up would continue to rise (never exceeding 2×) for larger and more
computationally intensive QM/MM systems.
Finally, it is worth noting that the represented configuration, where a single Markov Chain is run
in multiple accelerators, is particularly useful when #accelerators > #chains, which corresponds to a
situation in a many-node heterogeneous cluster. The scalability in a multiple Markov chain situation, for
the case when #accelerators < #chains, will be discussed in Section 6.3.
6.3 Global PMC Results
To conclude the evaluation of the proposed parallel solution, the execution of the complete PMC
simulation is assessed (including the QM Update and the PMC Cycle stages). For such purpose, a
greater focus will be given to bench-R, corresponding to the longest and more realistic dataset. A de-
tailed discussion in terms of the chemical aspects of the obtained results is discussed in [15], validating
the results with the work where this dataset was first described [11]. Figure 6.7 depicts the simulation
results, showing the conversion of the chorismate structure into prephenate.
Table 6.5 presents the execution times for the inner PMC Cycle (comprising 50k steps), the QM Up-
61
Figure 6.7: QM/MM Simulation box for the bench-R dataset (partial representation), together with thesimulation results for the conversion of the chorismate structure into prephenate.
Table 6.5: bench-R execution time for the PMC Cycle (50k steps) and QM Update (24.8M iters) stages,as well as for the full PMC simulation. The presented results consider two baselines and four parallelsolutions, with either a single or 8 Markov chains and fp64 or fp64 -fp32 precision.
Execution time (s)Setup PMC Cycle QM Update Full Simulation
mcx4-baseline 1883.1 96.4 980038.0 s = 272.23 h300028.0 s = 83.34 h69026.4 s = 19.17 h10156.2 s = 2.82 h52833.4 s = 14.68 h7757.7 s = 2.15 h
avx2-baseline 572.7 32.2mcx4 fp64 1-chain 42.9 96.4mcx4 fp64 8-parallel-chains 42.9 113.4mcx4 fp64 -fp32 1-chain 10.2 96.4mcx4 fp64 -fp32 8-parallel-chains 10.2 113.4
date and the full PMC application (comprising 496 PMC outer iterations, which yields a total of 24.8M
PMC Cycle steps). This performance study was conducted with the mcx4 platform, since it has the
largest number of CPU cores (8), allowing to spawn up to 8 independent Markov Chains while schedul-
ing their respective PMC Cycles on two GTX680 GPUs. It is important to notice the chemical importance
of spawning multiple Markov chains, as it allows a better coverage of the chemical solution space, thus
improving the quality of the results. The execution times were measured for a single and for 8 Markov
chains, as well as for two reference versions: the avx2-baseline and one single core of mcx4 (henceforth
referred to as mcx4-baseline). Both reference timings are presented to avoid misclassifying the pre-
sented parallelization quality. In fact, although the reference with the best performance is avx2-baseline,
directly comparing the parallel solution on mcx4 with an AVX2-enabled core of Intel i7-4770k CPU would
be unfair, because the latter runs the QM Update roughly 3× faster (32.2s versus 96.4s), mitigating the
performance gains in the OpenCL accelerated part (the PMC Cycle). Nevertheless, considerable speed-
up gains are achieved even when comparing to the faster avx2-baseline reference. Table 6.6 presents
the corresponding speed-up results versus both baselines (using the timings introduced in Table 6.5).
Considering all these run configurations, the execution time of the full bench-R for the two reference
scenarios correspond to 272.23 h (hours) and 83.34 h, respectively. The parallel solutions, reduce these
execution times considerably, ranging from 19.17 h to 2.15 h, depending on the number of spawned
Markov chains (either single or 8 chains) and the chosen numerical precision. For the single-chain
case, the obtained speed-up is mainly due to the OpenCL acceleration of the PMC Cycle. As shown
62
Table 6.6: Performance speed-ups for bench-R, considering the execution times presented in Table 6.5.
Speed-up versus mcx4-baselineSetup PMC Cycle Full Simulation
mcx4 fp64 1-chain 43.80× 14.20×mcx4 fp64 8-parallel-chains 43.80× 96.50×mcx4 fp64 -fp32 1-chain 184.23× 18.55×mcx4 fp64 -fp32 8-parallel-chains 184.23× 126.33×
Speed-up versus avx2-baselineSetup PMC Cycle Full Simulation
mcx4 fp64 1-chain 13.32× 4.35×mcx4 fp64 8-parallel-chains 13.32× 29.55×mcx4 fp64 -fp32 1-chain 56.02× 5.67×mcx4 fp64 -fp32 8-parallel-chains 56.02× 38.68×
in Table 6.6, a speed-up of up to 184.23× is obtained in the PMC Cycle alone. However, this speed-
up will be affected by Amdahl’s law, due to the QM Update fraction running on the CPU. In fact, by
looking at the mcx4-baseline reference scenario, one can observe that in the original run the PMC Cycle
represented 1883.11883.1+96.4 = 95.13% of each PMC iteration (PMC Cycle + QM Update). Hence, the speed-
up of 18.55× presented in Table 6.6 was expected, since the speed-up of 184.23× obtained in the PMC
Cycle (fp64 -fp32 version) would at maximum yield 10.9513184.23+0.0487
u 18.57× global speed-up. Therefore,
one can observe that the single-chain runs are limited by the QM Update fraction (4.87% in the mcx4-
baseline scenario), which uses the MOLPRO closed source program package, a necessary tool in the
current approach to the involved QM chemical calculations [14]. To tackle this limitation, the multiple
Markov chain approach was devised in this dissertation.
For the multiple Markov chain case, the attained speed-up is mainly due to parallel MC state-space
exploration. In fact, by comparing the single with multiple chain speed-up values for the same precision
approach, a scalable speed-up trend can be observed from the obtained results. For example, by
comparing the speed-ups attained in the mcx4 fp64 -fp32 for the cases of 1 and 8 chains, a speed-up
ratio of 126.33×18.55× = 6.81× is obtained. It is important to recall that the speed-up attainable by adding more
chains is limited by Equation 4.1, by the Host-side thread management, and by the overhead introduced
by concurrent memory and disk accesses issued by the CPU cores running the QM Updates in parallel.
In this case, one can verify from Table 6.5 that the mean QM Update execution time has degraded from
96.4s to 113.4s. Furthermore, although Equation 4.1 would yield a theoretical maximum of 23 chains for
this particular case, we are limited by the 8-cores available in mcx4, thus limiting the maximum number
of chains one can run on that platform to 8. Hence, the considered multiple Markov Chain solution
is achieving an efficiency of 38.685.67×8(#cores) = ∼85%. Nevertheless, one can conclude that by using the
same GPUs for the PMC Cycle acceleration, the proposed implementation would scale well to integrate
a system with up to 23 CPU cores. Increasing the number of OpenCL accelerators would increase this
limit even further.
Among the presented results, the most conservative speed-up of this parallel implementation is as-
sumed to be 36.86× (mcx4 fp64 -fp32 8-Chain versus avx2-baseline - see Table 6.6), as the avx2-baseline
corresponds to the reference with best performance. Naturally, the 126.33× speed-up obtained when
63
comparing mcx4 to itself could remain close to this value if a better Intel Xeon CPU had been used in
both the reference and the parallel solutions.
6.4 Numerical Evaluation: Convergence Accuracy and Energy Con-sumption
While the proposed parallel implementation does not make any approximation or relaxation in re-
spect to the original sequential method, yielding exactly the same output as the original PMC imple-
mentation, it is important to consider different numerical precisions, and to evaluate how they impact
execution performance, energy consumption and quality of the results. Accordingly, besides the original
64bit floating-point representation (fp64), the presented OpenCL version offers the following numerical
representation alternatives : i) mixed 64bit and 32bit floating-point (fp64 -fp32 ), or ii) mixed 64bit/32bit
floating-point and 32bit fixed-point (fp32 -i32 ). In the fp64 -fp32 version, the computationally more complex
q3m finish/mm finish kernels use 32bit floating-point precision for the ∆EC,gridQM/MM energy computations,
whereas 64bit floating-point is employed for the remaining energy terms, which have much faster com-
putations. This configuration also assumes the same data-type to store the grid, as well as a copy of the
lattice and the chmol. Likewise, the fp32 -i32 version uses 32bit floating-point numerical representations
for the same energy computation, but it uses 32bit fixed-point for the squared distances. The latter op-
erates on normalized grid and atom coordinates, represented by 32bit integers, which actually provides
a higher precision than the alternative 32bit floating-point. The usage of mixed precision for different
energy terms calls for casting operations, which may degrade the performance in a GPU accelerator. To
circumvent this degradation, all the necessary casting operations were moved to the monte carlo and
decide kernels, concentrating the necessary conversions in the single-threaded procedures of these
kernels and avoiding redundant casting in the many-thread kernels.
The resulting performance gains, for each of the considered precisions when executing the q3m c
kernel, are presented in Table 6.7. Depending on the adopted GPU device, execution speed-ups as
high as 8.89× can be attained by simply adopting lower resolutions, with minor degradations of the
obtained energy results. However, the generated system configurations will be the same as long as
the accumulated error does not cause the sequence of selected systems to diverge, which was verified
to be the case for all the considered benchmarks. In Table 6.8, the error introduced in the ∆EC,gridQM/MM
term and the total system energy (E) are presented for each kernel version, with respect to the fp64
implementation (which is numerically equivalent to the original serial version). It can be observed that
the fp32 -i32 version offers higher precision than the fp64 -fp32 , due to the greater number of significant
bits used for the squared distances operations. In these simulations, it was assured that em = 1.0 ×
10−1kJ/mol was the maximum error, as commonly considered in this research domain.
In order to further assess the impact of the considered mixed precision solutions, the averaged
energy consumption was measured on the Nvidia K20C GPU, by using the NVML library. The method
introduced in [8] was used to gather the attained power measurements, by using the maximum allowed
sampling frequency of 66.7Hz. Since this frequency is too low to sample one kernel launch of q3m c
64
Table 6.7: Speed-up of the mixed precision q3m c kernel versions versus the original fp64 version,running on the same machine, for the case of bench-A.
q3m c speed-up (vs f64)version GTX680 GTX780Ti K20Cfp64 -fp32 8.56× 7.39× 2.65×fp32 -i32 8.89× 7.44× 2.74×
Table 6.8: Obtained numerical precision. The error is shown for the ∆ECQM/MM energy term, as well as for
the total energy of the system (E), when considering the em = 1.0 × 10−1kJ/mol maximum error. Theaverage values were taken from the complete set of generated QM/MM systems, by using bench-A.
Error vs fp64
∆ECQM/MM(kJ/mol) E(kJ/mol)
version measurement value % of em value % of em
fp64 -fp32mean 6.4×10−5 0.064 4.2×10−3 4.2max 2.9×10−3 2.9 1.6×10−2 16
fp32 -i32mean 1.6×10−5 0.016 9.0×10−4 0.9max 1.1×10−3 1.1 1.6×10−2 16
(which executes in the order of hundreds of microseconds), a testbench with just the q3m c kernel was
built and launched repeatedly for 100k steps.
The obtained results are presented in Table 6.9. The first aspect worth noting refers to the configu-
ration that presented the highest average power: the fp64 -fp32 . This fact can be justified by the higher
core occupation allowed with the single precision floating-point implementation. The fp64 version has
a lower average power dissipation due to the opposite reason, i.e. its lower GPU occupation, resulting
in a reduced dynamic power requirement. For the case of the fp32 -i32 configuration, it is expected a
similar GPU occupation relative to fp64 -fp32 , although the integer functional units consume less power,
resulting in a 8W decrease in average power. To complement and further justify these observations,
power and energy consumption were also measured on the avx2-baseline configuration, by using the
SchedMon power and energy measurement tool [50]. Although the avx2-baseline draws (on average)
approximately 4 times less power than the most energy efficient parallel configuration on the K20C
GPU (fp32 -i32 ), the acceleration attained by the GPU in the execution time of the q3m c kernel greatly
compensates this, yielding a much lower overall energy consumption, thus saving up to 28.8× energy.
Although the same tests could not be performed on the GTX680 and GTX780Ti GPUs (these GPUs
do not feature internal power counters), one can predict rather similar energy savings for the GTX780Ti
GPUs accelerator, since it shares the same Kepler core architecture (GK110).
6.5 Summary
In this Chapter, a detailed performance assessment of the devised parallel heterogeneous approach
to the PMC QM/MM method was presented and discussed. The main considered performance metric
of interest was the executed time speed-up, when comparing the parallel solutions to either the avx2-
baseline (a single core of the i7-4770K processor, with AVX2 instructions enabled) or the mcx4-baseline
(a single core of Xeon E5-2609). To accomplish these profiling measurements, several tools were em-
65
Table 6.9: Execution time speed-up, energy savings and average power consumption, when comparingthe Tesla K20C GPU running all the devised numerical precision approaches with avx2-baseline (withthe original fp64 precision). The testbench was run on the K20C GPU for 100k steps, in order to ensurea representative sampling of the computational cost of q3m c. The default core frequency configurationwas used for all experiments.
Setup q3m c time (µs) q3m c speed-up Energy savings Avg. Poweravx2-baseline fp64 76077 1× 1× 34W
K20C fp64 1775 42.9× 10.4× 140W
K20C fp64 -fp32 670 113.5× 26.3× 147W
K20C fp32 -i32 647 117.6× 28.8× 139W
ployed, namely the PAPI [35] library, OpenCL Profiling Events, and the kcachegrind tool (based on
valgrind [37]). The performance of the parallel solution was assessed by using four sets of chemical
datasets, carefully designed by chemical experts from the Institut fur Physikalische Chemie, Georg-
August-Universitat Gottingen. In particular, a chorismate reaction dataset, relevant to the field of appli-
cation [11, 15], was benchmarked. The experiments for this particular dataset yielded 56× execution
time speed-up in the simulation bottleneck (PMC Cycle), and 38× speed-up for the full simulation (when
compared to the avx2-baseline). This is a significant acceleration, since it reduced the full execution
time from ∼80 hours to ∼2 hours. Furthermore, a scalability of 85% was observed for the case of 8
Markov chains executing in a platform with 8 CPU cores and 2 GPUs. Finally, the numerical quality
and energy consumption of the proposed solution were evaluated (by using the SchedMon power and
energy measurement tool [50]) for alternative numerical representation schemes. Energy savings of up
to 28× were observed in the heaviest kernel of the simulation bottleneck.
66
CHAPTER 7
CONCLUSIONS
Contents7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
67
The objective of this MSc thesis was to accelerate the PMC QM/MM algorithm by designing an
efficient and scalable parallel implementation for heterogeneous architectures comprised by a multi-core
CPU and one or more accelerators (e.g., GPUs). In particular, the performance of the devised solution
was to be evaluated in several system configurations, by studying molecular simulations relevant to
the Theoretical Chemistry field of application. The major metric of interest was the obtained speed-
up in respect to the original serial version, although the consumed energy and the resulting numerical
precision were also a target of analysis and discussion. The OpenCL library was chosen as the parallel
framework of choice to allow targeting heterogeneous architectures.
Before parallelizing the target application, an optimized single-core version was developed, in or-
der to have a fair performance baseline. At this respect, optimized data structures and other prelim-
inary optimization schemes were employed. Two main procedures were identified: the QM Update
and the PMC Cycle. Then, a careful study of the available parallelization opportunities was made,
eventually leading to a multi-layered parallel solution, extracting parallelism by: i) running several in-
dependent QM Updates, each correspondent to a Markov chain (chain-level parallelism); ii) executing
the PMC Cycle procedures in parallel in respect to each other (task-level parallelism); iii) executing the
inner iterations of each procedure in parallel, for different sections of the dataset (data-level parallelism).
Considering this approach, the compute devices in the target heterogeneous node architecture were
tasked with different parts of the problem, scheduling the intrinsically serial QM Update processes to the
CPU cores (one instance per core), and the highly task and data-parallel PMC Cycle to the available
OpenCL accelerators. Concurrently to the accelerators, the host CPU manages to ensure a dynamic
load balancing, by distributing the workload of the heaviest kernels among multiple accelerators.
Following, a detailed performance assessment of the devised parallel heterogeneous approach to
the PMC QM/MM method was presented and discussed. As a result, by exploiting the massively parallel
GPU architecture, the computational bottleneck in the original single-core approach was accelerated to
56×, for the case of a well known chorismate dataset. To further promote the scalability of the proposed
implementation, the MC state-space was further sampled using several independent Markov Chains,
which was proved to scale with an efficiency of 85%. In a cumulative perspective, the complete PMC sim-
ulation yielded a speed-up of 38×, effectively reducing the full execution time of the chorismate QM/MM
simulation from ∼80 hours to ∼2 hours, achieving considerable savings in terms of time and energy.
Other chemical benchmarks were evaluated, to assess the particular performance of the PMC Cycle.
For the case of a typical arginine dataset, a speed-up of up to 152× was achieved, when running the
PMC Cycle on two heterogeneous Nvidia GPUs, and up to 111× when using an heterogeneous system
composed by an AMD GPU and an Nvidia GPU.
In conclusion, the proposed objectives to this dissertation were met. The cumulative contributions
of this thesis to the scientific community have resulted in two research articles. One has been already
submitted for publication in an international peer-reviewed journal [33], whereas another is awaiting
submission [15]. In addition, the resulting application is now being actively used by the Free Floater
Research Group - Computational Chemistry and Biochemistry, Institut fur Physikalische Chemie, Georg-
August-Universitat Gottingen, for further scientific studies. The resulting parallel program package will
68
be released under the BSD-3-clause open source licence.
7.1 Future Work
The present dissertation was naturally limited by the time available to develop a MSc thesis. There-
fore, a few optimizations and additional parallel schemes where not considered, either due to lack of
time or to deviation from the thesis scope. At this respect, the following future work is proposed:
(i) In order to exploit multiple nodes in a computing network, an MPI [26] solution could be devised,
to allow running a larger amount of Markov chains. Since a fairly good scalability was attained in
a multi-core CPU environment, and since almost no communication is required between Markov
chains, this approach could achieve good results.
(ii) An approach to simultaneously exploit the FPGA architecture and the CPU and GPU architectures
could be designed. Since OpenCL is now supported by Altera FPGAs, this would pose an interest-
ing scenario. This approach was not considered a priority in this dissertation, since one of the main
objectives was to optimize the developed solution to efficiently run in accelerators commonly found
in computational chemistry research groups (e.g. CPUs and GPUs).
(iii) The energy measurement approach herein considered could be integrated in the load balancing
algorithm, in order to achieve an energy-aware balancing solution. This approach was discarded
for the case of this dissertation, since among the selected hardware platforms, the only GPU able
to perform energy measurements is the Nvidia Tesla K20C.
(iv) Study of a possible grid partitioning approach, such as the neighbor-list or the cell-list schemes
discussed in [4, 49, 56]. However, since for the case of the PMC QM/MM algorithm the QM grid
has to be updated every PMC (outer) iteration (by the QM Update), it is not obvious if a cell-list
or neighbor-list approach would bring additional performance gains, since the overhead of creating
these lists would have to be repeated every PMC iteration.
69
70
REFERENCES
[1] Alejandro Acosta, Robert Corujo, Vicente Blanco, and Francisco Almeida. Dynamic load balancing
on heterogeneous multicore/multigpu systems. In High Performance Computing and Simulation
(HPCS), 2010 International Conference on, pages 467–476. IEEE, 2010.
[2] Amos G Anderson, William A Goddard III, and Peter Schroder. Quantum monte carlo on graphical
processing units. Computer Physics Communications, 177(3):298–306, 2007.
[3] Joshua A. Anderson, Eric Jankowski, Thomas L. Grubb, Michael Engel, and Sharon C. Glotzer.
Massively parallel monte carlo for many-particle simulations on GPUs. Journal of Computational
Physics, 254:27–38, December 2013.
[4] Joshua A Anderson, Chris D Lorenz, and Alex Travesset. General purpose molecular dynamics
simulations fully implemented on graphics processing units. Journal of Computational Physics,
227(10):5342–5359, 2008.
[5] Cedric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-Andre Wacrenier. StarPU: a
unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and
Computation: Practice and Experience, 23(2):187–198, 2011.
[6] Bernd A Berg and A Billoire. Markov chain monte carlo simulations. World Scient., Singapore,
2004.
[7] Alecio Pedro Delazari Binotto, Carlos Eduardo Pereira, Arjan Kuijper, Andre Stork, and Dieter W
Fellner. An effective dynamic scheduling runtime and tuning system for heterogeneous multi and
many-core desktop platforms. In High Performance Computing and Communications (HPCC), 2011
IEEE 13th International Conference on, pages 78–85. IEEE, 2011.
[8] Martin Burtscher, Ivan Zecena, and Ziliang Zong. Measuring gpu power with the k20 built-in sensor.
In Proceedings of Workshop on General Purpose Processing Using GPUs, page 28. ACM, 2014.
[9] Ricolindo L Carino and Ioana Banicescu. Dynamic load balancing with adaptive factoring methods
in scientific applications. The Journal of Supercomputing, 44(1):41–63, 2008.
71
[10] Long Chen, Oreste Villa, Sriram Krishnamoorthy, and Guang R Gao. Dynamic load balanc-
ing on single-and multi-gpu systems. In Parallel & Distributed Processing (IPDPS), 2010 IEEE
International Symposium on, pages 1–12. IEEE, 2010.
[11] Frederik Claeyssens, Kara E Ranaghan, Narin Lawan, Stephen J Macrae, Frederick R Manby,
Jeremy N Harvey, and Adrian J Mulholland. Analysis of chorismate mutase catalysis by qm/mm
modelling of enzyme-catalysed and uncatalysed reactions. Organic & biomolecular chemistry,
9(5):1578–1590, 2011.
[12] David Clarke, Alexey Lastovetsky, and Vladimir Rychkov. Dynamic load balancing of parallel com-
putational iterative routines on highly heterogeneous hpc platforms. Parallel Processing Letters,
21(02):195–217, 2011.
[13] Kenneth P Esler, Jeongnim Kim, David M Ceperley, and Luke Shulenburger. Accelerating quantum
monte carlo simulations of real materials on gpu clusters. Computing in Science & Engineering,
14(1):40–51, 2012.
[14] Jonas Feldt. Entwicklung einer Storungstheoretischen QM/MM Monte Carlo Methode fur die Studie
von Molekulen in Losung. Master’s thesis, Georg-August-Universitat Gottingen, 2013.
[15] Jonas Feldt, Sebastiao Miranda, Joao C. A. Oliveira, Frederico Pratas, Nuno Roma, Pedro Tomas,
and Ricardo A. Mata. Perturbative monte carlo mixed quantum mechanics/molecular mechanics.
Journal of Chemical Information and Modeling (to be submited).
[16] Christopher J. Fennell and J. Daniel Gezelter. Is the ewald summation still necessary? pairwise al-
ternatives to the accepted standard for long-range electrostatics. The Journal of Chemical Physics,
124(23):234104, 2006.
[17] Mark S. Friedrichs, Peter Eastman, Vishal Vaidyanathan, Mike Houston, Scott Legrand, Adam L.
Beberg, Daniel L. Ensign, Christopher M. Bruns, and Vijay S. Pande. Accelerating molec-
ular dynamic simulation on graphics processing units. Journal of Computational Chemistry,
30(6):864–872, 2009.
[18] George D Geromichalos. Importance of molecular computer modeling in anticancer drug develop-
ment. Journal of BU ON.: Official Journal of the Balkan Union of Oncology, 12:S101, 2007.
[19] Charles J Geyer. Practical markov chain monte carlo. Statistical Science, pages 473–483, 1992.
[20] Walter R Gilks, Sylvia Richardson, and David J Spiegelhalter. Introducing markov chain monte
carlo. In Markov chain Monte Carlo in practice, pages 1–19. Springer, 1996.
[21] Valentin Gogonea, Lance M Westerhoff, and Kenneth M Merz Jr. Quantum mechanical/quantum
mechanical methods. i. a divide and conquer strategy for solving the schrodinger equation for large
molecular systems using a composite density functional–semiempirical hamiltonian. The Journal
of Chemical Physics, 113(14):5604–5613, 2000.
72
[22] Khronos OpenCL Working Group. The OpenCL Specification version 1.2 revision 19, 2012.
[23] Clifford Hall, Weixiao Ji, and Estela Blaisten-Barojas. The metropolis monte carlo method with
CUDA enabled graphic processing units. Journal of Computational Physics, 258:871–879, Febru-
ary 2014.
[24] Intel. Intel SDK for OpenCL* Applications 2013 R2 Optimization Guide, pages 14-15, 2013.
[25] William L. Jorgensen, David S. Maxwell, and Julian Tirado-Rives. Development and testing of the
OPLS all-atom force field on conformational energetics and properties of organic liquids. Journal
of the American Chemical Society, 118(45):11225–11236, January 1996.
[26] Mario Lauria and Andrew Chien. Mpi-fm: High performance mpi on workstation clusters. Journal
of Parallel and Distributed Computing, 40(1):4–18, 1997.
[27] V.I. Lebedev. Values of the nodes and weights of ninth to seventeenth order gauss-markov
quadrature formulae invariant under the octahedron group with inversion. USSR Computational
Mathematics and Mathematical Physics, 15(1):44–51, January 1975.
[28] Arnaud Legrand, Helene Renard, Yves Robert, and Frederic Vivien. Mapping and load-balancing
iterative computations. Parallel and Distributed Systems, IEEE Transactions on, 15(6):546–558,
2004.
[29] Cong Liu, Jian Li, Wei Huang, Juan Rubio, Evan Speight, and Xiaozhu Lin. Power-efficient time-
sensitive mapping in heterogeneous systems. In Proceedings of the 21st international conference
on Parallel architectures and compilation techniques, pages 23–32. ACM, 2012.
[30] Y Lutsyshyn. Fast quantum monte carlo on a gpu. arXiv preprint arXiv:1312.1282, 2013.
[31] Tadaaki Mashimo, Yoshifumi Fukunishi, Narutoshi Kamiya, Yu Takano, Ikuo Fukuda, and Haruki
Nakamura. Molecular dynamics simulations accelerated by GPU for biological macromolecules with
a non-ewald scheme for electrostatic interactions. Journal of Chemical Theory and Computation,
9(12):5599–5609, December 2013.
[32] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward
Teller. Equation of state calculations by fast computing machines. The Journal of Chemical Physics,
21:1087, 1953.
[33] Sebastiao Miranda, Jonas Feldt, Frederico Pratas, Ricardo Mata, Nuno Roma, and Pedro Tomas. A
parallel heterogeneous approach to perturbative monte carlo qm/mm simulations. Journal of High
Performance Computing Applications (submited).
[34] Lubos Mitas. Diffusion monte carlo. Quantum Monte Carlo Methods in Physics and Chemistry,
525:247, 1998.
73
[35] Philip J Mucci, Shirley Browne, Christine Deane, and George Ho. Papi: A portable interface to
hardware performance counters. In Proceedings of the Department of Defense HPCMP Users
Group Conference, pages 7–10, 1999.
[36] Michael E. Mura and Peter J. Knowles. Improved radial grids for quadrature in molecular density-
functional calculations. The Journal of Chemical Physics, 104(24):9848–9858, June 1996.
[37] Nicholas Nethercote and Julian Seward. Valgrind: A program supervision framework. Electronic
notes in theoretical computer science, 89(2):44–66, 2003.
[38] Matıas A Nitsche, Manuel Ferreria, Esteban E Mocskos, and Mariano C Gonzalez Lebrero. Gpu
accelerated implementation of density functional theory for hybrid qm/mm simulations. Journal of
Chemical Theory and Computation, 10(3):959–967, 2014.
[39] NVIDIA. Kepler GK110, version 1.0, 2012.
[40] NVIDIA. CUDA C Programming Guide, version v5.5, 2013.
[41] CUDA NVIDIA. Gpu occupancy calculator. CUDA SDK, 2010.
[42] Robert E Overman, Jan F Prins, Laura A Miller, and Michael L Minion. Dynamic load balanc-
ing of the adaptive fast multipole method in heterogeneous systems. In Parallel and Distributed
Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International, pages
1126–1135. IEEE, 2013.
[43] John P. Perdew, Kieron Burke, and Matthias Ernzerhof. Generalized gradient approximation made
simple. Physical Review Letters, 77(18):3865–3868, October 1996.
[44] Frederico Pratas, Leonel Sousa, Johannes M. Dieterich, and Ricardo A. Mata. Computation of in-
duced dipoles in molecular mechanics simulations using graphics processors. Journal of Chemical
Information and Modeling, 52(5):1159–1166, May 2012.
[45] Giuseppe Scarpa, Raffaele Gaetano, Michal Haindl, and Josiane Zerubia. Hierarchical multiple
markov chain model for unsupervised texture segmentation. Image Processing, IEEE Transactions
on, 18(8):1830–1843, 2009.
[46] ”Desh Singh, Tom Czajkowski, and Altera Corporation” Andrew Ling. Tutorial: Harnessing the
Power of FPGAs using Altera’s OpenCL Compiler, 2013.
[47] Kyle Spafford, Jeremy Meredith, and Jeffrey Vetter. Maestro: data orchestration and tuning for
opencl devices. In Euro-Par 2010-Parallel Processing, pages 275–286. Springer, 2010.
[48] John E. Stone, James C. Phillips, Peter L. Freddolino, David J. Hardy, Leonardo G. Trabuco, and
Klaus Schulten. Accelerating molecular modeling applications with graphics processors. Journal of
Computational Chemistry, 28(16):2618–2640, 2007.
74
[49] Alfeus Sunarso, Tomohiro Tsuji, and Shigeomi Chono. GPU-accelerated molecular dynamics sim-
ulation for study of liquid crystalline flows. Journal of Computational Physics, 229(15):5486–5497,
2010.
[50] Luis Tanica, Aleksandar Ilic, Pedro Tomas, and Leonel Sousa. Schedmon: A performance and
energy monitoring tool for modern multi-cores. 7th International Workshop on Multi/many-Core
Computing Systems (MuCoCus’2014, 2014.
[51] Thanh N. Truong and Eugene V. Stefanovich. Development of a perturbative approach for monte
carlo simulations using a hybrid ab initio QM/MM method. Chemical Physics Letters, 256(3):348–
352, June 1996.
[52] Yutaka Uejima, Tomoharu Terashima, and Ryo Maezono. Acceleration of a qm/mm-qmc simulation
using gpu. Journal of Computational Chemistry, 32(10):2264–2272, 2011.
[53] Bart Verleye, Pierre Henri, Roel Wuyts, Giovanni Lapenta, and Karl Meerbergen. Implementa-
tion of a 2d electrostatic particle-in-cell algorithm in unified parallel c with dynamic load-balancing.
Computers & Fluids, 80:10–16, 2013.
[54] Florian Weigend and Reinhart Ahlrichs. Balanced basis sets of split valence, triple zeta valence and
quadruple zeta valence quality for h to rn: Design and assessment of accuracy. Physical Chemistry
Chemical Physics, 7(18):3297–3305, August 2005.
[55] H.-J. Werner, P. J. Knowles, G. Knizia, F. R. Manby, M. Schutz, et al. MOLPRO, version 2012.1, a
package of ab initio programs. molpro, 2012. see www.molpro.net.
[56] Zhenhua Yao, Jian-Sheng Wang, Gui-Rong Liu, and Min Cheng. Improved neighbor list algorithm
in molecular simulations using cell decomposition and data sorting method. Computer Physics
Communications, 161(1):27–35, 2004.
[57] Weihang Zhu and Yaohang Li. Gpu-accelerated differential evolutionary markov chain monte carlo
method for multi-objective optimization over continuous space. In Proceedings of the 2nd workshop
on Bio-inspired algorithms for distributed systems, pages 1–8. ACM, 2010.
75
Top Related