Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln
description
Transcript of Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln
![Page 1: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/1.jpg)
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI
Hardware
Sumanth J.V, David R. Swanson and Hong Jiang
University of Nebraska-Lincoln
We thank ONR, RCF and SDI(NSF 0091900) for funding this research andDr. Kenji Yasuoka and Dr. Takahiro Koishi for the many useful talks we had.
![Page 2: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/2.jpg)
Introduction
MD is very computationally intensive. Luckily, MD is parallelizable hence can be
efficiently implemented on a cluster. Custom VLSI solutions like the MD-GRAPE 2
are another approach, but is more limited to kinds of potential functions that can be evaluated.
We combine the above two techniques to combine the advantages of both.
![Page 3: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/3.jpg)
Computational Aspects of MD
Perform time integration of following equation
Forces are computed as
![Page 4: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/4.jpg)
Potential Function
We restrict ourselves to 2-body and 3-body potentials.
The MD-GRAPE 2 is designed to compute only 2-body potentials.
The cluster can however be programmed to evaluate any kind of potential.
We use a combination of the cluster and the MD-GRAPE 2 board to evaluate a 3-body potential.
![Page 5: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/5.jpg)
Lennard-Jones(LJ) Potential Is a very simple empirical 2-body potential
![Page 6: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/6.jpg)
Reactive Bond Order (REBO) Potential
![Page 7: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/7.jpg)
Simple MD Algorithm
Can improve execution time by using cut-off radius, neighbor lists, link cell or combination of these. Cut-off radius introduces discontinuities. Can be overcome by smoothing the potential function. Velocity-Verlet Integration.
![Page 8: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/8.jpg)
Boundary Conditions
Minimum Image ConventionPeriodic Boundary Conditions
![Page 9: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/9.jpg)
Neighbor Lists
![Page 10: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/10.jpg)
Link Cell Method
![Page 11: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/11.jpg)
Parallel MD – Atom Decomposition
Involves dividing up the N atoms into sets of N/P atoms and assigning each set to one of the P processors.
At every time-step, two global communication operations are required (one for updating positions and the other for updating forces).
Runs in time proportional to the square of the number of atoms N.
Very good efficiency but long running times. Is a suitable technique if the system is dense.
![Page 12: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/12.jpg)
Parallel MD – Spatial Decomposition
Involves dividing up the simulation box into domains and assigning domains to processors.
Communication is local. Efficiency is worse, but has lower running time. Works better when the system is not very dense. Load balancing can be performed by dynamically
varying the volume of the domains. When the system is not very dense, the running
time is nearly linear.
![Page 13: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/13.jpg)
Efficiency of Atom and Spatial Decomposition
Atom Decomposition Spatial Decomposition
![Page 14: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/14.jpg)
MD-GRAPE 2 for MD simulations
Parallel pipelined special purpose hardware for computing non-bonded forces.
Bonded forces and time integrations are performed on the host machine.
Can compute forces either in the all-pairs method or link-cell method.
If there are more than half a million atoms in the system, they must be split into batches of at most half a million before being sent to the MD-GRAPE 2.
Peak Performance of 64Gflops.
![Page 15: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/15.jpg)
MD-GRAPE 2 Calculations
The forces and potentials are computed using the following two equations.
The function G(x) is evaluated using a segmented fourth order polynomial interpolation.
![Page 16: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/16.jpg)
MD-GRAPE 2 Architecture
![Page 17: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/17.jpg)
Link Cell Method on MD-GRAPE 2
![Page 18: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/18.jpg)
Number of Processors vs. ExecutionTime for MD-GRAPE 2 link cell methodand domain decomposition method.
Relative Error in Computing TotalEnergy with MD-GRAPE2
![Page 19: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/19.jpg)
Scheduling MD on a cluster and MD-GRAPE 2 simultaneously
The REBO is a three-body potential. It comprises of three two-body components VR, VA
and VvdW and a three-body component Bij. The MD-GRAPE 2 is not capable of computing
three-body potentials due to its architecture. The custom function evaluation table does not allow
for conditional statements to be placed in the function, but this feature is required to evaluate VR
and VA. This allows us to compute the VvdW on the MD-
GRAPE 2 and all the other components on the cluster.
![Page 20: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/20.jpg)
Scheduling MD on a cluster and MD-GRAPE 2 simultaneously
contd. The motivation for doing so is that the Vvdw
has a cut-off that is roughly twice that of the other components.
This can however be efficiently computed on the MD-GRAPE 2 while the other components are being evaluated on the cluster simultaneously.
To aid in communication between the cluster and the machine hosting the MD-GRAPE 2, we implement a server.
![Page 21: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/21.jpg)
Scheduling MD on a cluster and MD-GRAPE 2 simultaneously
contd. The server accepts a position vector and outputs a partial forces vector and a partial potential vector.
They are called partial since they only contain contributions due to the Vvdw component.
This has to be added to the other contributions that are computed on the cluster.
![Page 22: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/22.jpg)
Scheduling MD on a cluster and MD-GRAPE 2 simultaneously
contd. At every time step, before the parallel code begins its computations, it sends a copy of the position vector to the MD-GRAPE 2.
Now the cluster and the MD-GRAPE 2 compute partial forces/potentials simultaneously.
When the MD-GRAPE 2 completes its computations, it returns the partial forces/potentials to the host and the host sums them to give the actual forces and total potential energy.
The cut-off for the computations on the cluster is now 2.5 instead of 5.5 which is required if all the components of the REBO potential were computed on it.
![Page 23: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/23.jpg)
Scheduling MD on a cluster and MD-GRAPE 2 simultaneously
contd. The execution time of an MD simulation using the atom-decomposition method can be well approximated by a second degree polynomial tc(N).
The execution time on the MD-GRAPE 2 can also be approximated by a second degree polynomial tg(N).
The total time it takes to run such a simulation on the cluster and the MD-GRAPE 2 simultaneously is given by
![Page 24: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/24.jpg)
Scheduling MD on a cluster and MD-GRAPE 2 simultaneously
contd. The optimum number of processors to use can be determined by
Experimentally, we have determined an optimal p to be 35 for our setup. With this setup, we found the speedup to gradually approach 1.4 and nearly remain constant
after that. We used the atom-decomposition method since the system being simulated is very dense.
![Page 25: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/25.jpg)
Plot of speedup when using a cluster and MD-GRAPE 2 simultaneously vs. using a
cluster alone
![Page 26: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/26.jpg)
Conclusion
At the time of writing, cost per processor including network was 1500 USD.
Cost of MD-GRAPE 2 was 15000 USD. For long range potentials it is more cost effective to
use MD-GRAPE 2 since it takes 61 cluster CPUs to equal its performance.
For short range potentials, it is more effective to use a cluster since it takes only 12 cluster CPUs to match performance.
However, using a combination of a cluster and MD-GRAPE 2 to solve more complex potentials can yield a significant gain.
![Page 27: Sumanth J.V, David R. Swanson and Hong Jiang University of Nebraska-Lincoln](https://reader036.fdocuments.net/reader036/viewer/2022062322/5681449a550346895db1461b/html5/thumbnails/27.jpg)
Future Work
Incorporate multiple MD-GRAPE 2 boards into the current setup.
Schedule MD simulations on larger scale systems with Globus.
Custom FPGA solutions to solve more than just pair potentials.
Using GPUs to perform MD. Hand optimizing energy calculations to use
SSE/SSE2/SSE3 instructions for optimal performance.