Parallelization Strategies for Implementing Nbody Codes on Multicore Architectures
-
Upload
filipo-mor -
Category
Science
-
view
371 -
download
2
Transcript of Parallelization Strategies for Implementing Nbody Codes on Multicore Architectures
Pontifical Catholic University of Rio Grande do Sul – PUCRS
Graduate Program in Computer Science
Faculty of Informatics
Parallelization Strategies for N-Body
Simulations on Multicore ArchitecturesFilipo Novo Mór
Thais Christina Webber dos Santos
César Augusto Missio Marcon
GPU implementation
0
1
23
4
56
78
9
101112
1314
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 1 2
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers
0
1
23
4
56
78
9
101112
1314
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers
0
1
23
4
56
78
9
101112
1314
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
BA
RR
IER
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers
data bufferization
on shared
memory
data on buffer
consumption by
parallel threads
at the end, global
memory is
updated
Initially, information about particles is copied from RAM
to the GPU memory. Then, the code runs in a pipeline
where several data transferences between shared and
global memory will take place while data on shared
memory buffer is consumed. At the end of the process
all remaining data, now updated, is copied back to the
global memory and then back to the RAM on the CPU.
The computational cost is given by (1)
where C is the cost of the force calculation function; n
is the amount of particles; p is the amount of parallel
running threads; M is the cost of data transfer between
shared and global memories on GPU, and T is the cost
of data transfer between RAM and the GPU memory.
𝐶 𝑛2
𝑝 + 2𝑀𝑛 + 2𝑇𝑛 (1)
Multicore CPU Implementation
𝐶 𝑛2
𝑝
For a multicore CPU, a standard serial code
was parallelized by adding OpenMP
directives directly on it. The computational
cost was reduced from n2 to (2), once there
is no need to memory transfers. p is the
parallel threads, which normally is one per
amount of
CPU core.
(2)
The N-Body Simulation
The Particle-Particle method does not round
off the force summation, such that the
accuracy equals to the machine precision.
Energy Monitoring during the simulation.
Computational Cost: O(n2).
Partial Results and Perspectives Cost saving by real speedup achievement.
Visualization module already implemented.
Next Steps:
Cluster implementation (CUDA + MPI + GPUDirect).
Exploration of hierarchical algorithms (such as
Barnes&Hut and mesh).
OpenCL version.
OpenACC version.
Execution quoted on Amazon EC2 service.
Version Cost
Serial 0,33$
OpenMP 0,08$
CUDA 0,05$