M.S. Thesis Defense
Comparing The Performance Of Distributed Shared Memory And Message Passing Programs Using The Hyperion Java Virtual Machine On Clusters
Mathew Reno
M.S. Thesis Defense
Overview For this thesis we wanted to evaluate the
performance of the Hyperion distributed virtual machine, designed at UNH, when compared to a preexisting parallel computing API.
The results would indicate where Hyperion’s strength and weaknesses were and possibly validate Hyperion as a high-performance computing alternative.
M.S. Thesis Defense
What Is A Cluster? A cluster is a group of low cost
computers connected with an “off the shelf” network.
The cluster’s network is isolated from WAN data traffic and the computers on the cluster are presented to the user as a single resource.
M.S. Thesis Defense
Why Use Clusters? Clusters are cost effective when
compared to traditional parallel systems.
Clusters can be grown as needed. Software components are based on
standards allowing portable software to be designed for the cluster.
M.S. Thesis Defense
Cluster Computing Cluster computing takes advantage
of the cluster by distributing computational workload among nodes of the cluster, thereby reducing total computation time.
There are many programming models for distributing data throughout the cluster.
M.S. Thesis Defense
Distributed Shared Memory Distributed Shared Memory (DSM)
allows the user to view the whole cluster as one resource.
Memory is shared among the nodes. Each node has access to all other nodes memory as if it owns it.
Data coordination among nodes is generally hidden from the user.
M.S. Thesis Defense
Message Passing Message Passing (MP) requires
explicit messages to be employed to distribute data throughout the cluster.
The programmer must coordinate all data exchanges when designing the application through a language level MP API.
M.S. Thesis Defense
Related: Treadmarks Vs. PVM Treadmarks (Rice, 1995) implements a
DSM model while PVM implements a MP model. The two approaches were compared with benchmarks. On average, PVM was found to perform two
times better the Treadmarks. Treadmarks suffered from excessive messages
that were required for the request response communication DSM model employed.
Treadmarks was found to be more natural to program with saving development time.
M.S. Thesis Defense
Hyperion Hyperion is a distributed Java Virtual
Machine (JVM), designed at UNH. The Java language provides parallelism
through its threading model. Hyperion extends this model by distributing the threads among the cluster.
Hyperion implements the DSM model via DSM-PM2, which allows for lightweight thread creation and data distribution.
M.S. Thesis Defense
Hyperion, Continued Hyperion has a fixed memory size that it
shares with all threads executing across the cluster.
Hyperion uses page based data distribution; if a thread accesses memory it does not have locally, a page fault occurs and the memory is transmitted from the node that owns the memory to the requesting node a page at a time.
M.S. Thesis Defense
Hyperion, Continued Hyperion translates Java bytecodes
into native C code. A native executable is generated by
a native C compiler. The belief is that native executables
are optimized by the C compiler and will benefit the application by executing faster than interpreted code.
M.S. Thesis Defense
Hyperion’s Threads Threads are created in a round robin
fashion among the nodes of the cluster. Data is transmitted between threads via a
request/response mechanism. This approach requires two messages.
In order to respond to a request message, a response thread must be scheduled. This thread handles the request by sending back the requested data in a response message.
M.S. Thesis Defense
mpiJava mpiJava is a Java wrapper for the
Message Passing Interface (MPI). The Java Native Interface (JNI) is
used to translate between Java and native code.
We used MPICH for the native MPI implementation.
M.S. Thesis Defense
Clusters The “Star” cluster (UNH) consists of
16 PIII 667MHz Linux PCs on a 100Mb Fast Ethernet network. TCP is communication protocol.
The “Paraski” cluster (France) consists of 16 PIII 1GHz Linux PCs on a 2Gb Myrinet network. BIP (DSM) and GM (MPI) are the communication protocols.
M.S. Thesis Defense
Clusters, Continued The implementation of MPICH on
BIP was not stable in time for this thesis. GM had to be used in place of BIP for MPICH. GM has not been ported to Hyperion and a port would be unreasonable at this time.
BIP performs better than GM as the message size increases.
M.S. Thesis Defense
BIP vs. GM Latency (Paraski)
0
100
200
300
400
500
600
0 1 2 3 4 5 6 7 8
Message Size (KB)
Roundtr
ip L
ate
ncy
(μ
s)
MPI (GM) PM2 (BIP)
M.S. Thesis Defense
DSM & MPI In Hyperion For consistency, mpiJava was ported
into Hyperion. Both DSM and MPI versions of the
benchmarks could be compiled by Hyperion.
The executables produced by Hyperion are then executed by the respective native launchers (PM2 and MPICH).
M.S. Thesis Defense
Benchmarks The Java Grande Forum (JGF)
developed a suite of benchmarks to test Java implementations.
We used two of the JGF benchmark suites, multithreaded and javaMPI.
M.S. Thesis Defense
Benchmarks, Continued Benchmarks used:
Fourier coefficient analysis Lower/upper matrix factorization Successive over-relaxation IDEA encryption Sparse matrix multiplication Molecular dynamics simulation 3D Ray Tracer Monte Carlo simulation (only with MPI)
M.S. Thesis Defense
Benchmarks And Hyperion The multi-threaded JGF benchmarks had
unacceptable performance when run “out of the box”. Each benchmark creates all of its data objects
on the root node causing all remote object access to occur through this one node.
This type of access causes a performance bottleneck on the root node as it has to service all the requests while calculating its algorithm part.
The solution was to modify the benchmarks to be cluster aware.
M.S. Thesis Defense
Hyperion Extensions Hyperion makes up for Java’s
limited thread data management by providing efficient reduce and broadcast mechanisms.
Hyperion also provides a cluster aware implementation of arraycopy.
M.S. Thesis Defense
Hyperion Extension: Reduce Reduce blocks all enrolled threads until
each thread has the final result of the reduce.
This is done by neighbor threads exchanging their data for computation, then their neighbors, and so on until each thread has the same answer.
This operation is faster and scales well as opposed to performing the calculation serially. The operation is LogP.
M.S. Thesis Defense
Hyperion Extension: Broadcast The broadcast mechanism
transmits the same data to all enrolled threads.
Like reduce, data is distributed to the threads in a LogP fashion, which scales better than serial distribution of data.
M.S. Thesis Defense
Hyperion Extension: Arraycopy The arraycopy method is part of the
Java System class. The Hyperion version was extended to be cluster aware.
If data is copied across threads, this version will send all data as one message instead of relying on paging mechanisms to access remote array data.
M.S. Thesis Defense
Benchmark Modifications The multithreaded benchmarks
had unacceptable performance. The benchmarks were modified in
order to reduce remote object access and root node bottlenecks.
Techniques, such as arraycopy, broadcast and reduce were employed to improve performance.
M.S. Thesis Defense
Experiment Each benchmark was executed 50
times at each node size to provide a sample mean.
Node sizes were 1, 2, 4, 8, and 16. Confidence intervals (95% level)
were used to determine which version, MPI or DSM, performed better.
M.S. Thesis Defense
Results On The Star Cluster
0
0.5
1
1.5
2
2.5
3
3.5
4
1 2 4 8 16
Number of Nodes
Perf
orm
ance
Rati
o
(DSM
/M
PI)
CryptLUFactMoldynSeriesSORSparseRayTracer
M.S. Thesis Defense
Results On The Paraski Cluster
00.20.40.60.8
11.21.41.61.8
1 2 4 8 16
Number of Nodes
Perf
orm
ance
Rati
o
(DSM
/M
PI)
CryptLUFactMoldynSeriesSORSparseRayTracer
M.S. Thesis Defense
Fourier Coefficient Analysis Calculates the first 10,000 pairs of
Fourier coefficients. Each node is responsible for
calculating its portion of the coefficient array.
Each node sends back its array portion to the root node, which accumulates the final array.
M.S. Thesis Defense
Fourier: DSM Modifications The original multithreaded version
required all threads to update arrays located on the root node, causing the root node to be flooded with requests.
The modified version used arraycopy to distribute the local arrays back to the root thread arrays.
M.S. Thesis Defense
Fourier: mpiJava The mpiJava version is similar to
the DSM version. Each process is responsible for its
portion of the arrays. MPI_Ssend and MPI_Recv were
called to distribute the array portions to the root process.
M.S. Thesis Defense
Fourier: Results
Star
0
5
10
15
20
1 2 4 8 16
Nodes
Seco
nds
MPI DSM
Paraski
02468
1012
1 2 4 8 16
NodesSeco
nds
MPI DSM
M.S. Thesis Defense
Fourier: Conclusions Most of the time in this benchmark
is spent in the computation. Network communication does not
play a significant role in the overall time.
Both MPI and DSM perform similar on each cluster, scaling well when more nodes are added.
M.S. Thesis Defense
Lower/Upper Factorization Solves a 500 x 500 linear system
with LU factorization followed with a triangular solve.
The factorization is parallelized while the triangular solve is computed in serial.
M.S. Thesis Defense
LU: DSM Modifications The original version created the matrix
on the root thread and all access was through this thread, causing performance bottlenecks.
The benchmark was modified to use Hyperion’s Broadcast facility to distribute the pivot information and arraycopy was used to coordinate the final data for the solve.
M.S. Thesis Defense
LU: mpiJava MPI_Bcast is used to distribute the
pivot information. MPI_Send and MPI_Recv are used
so the root process can acquire the final matrix.
M.S. Thesis Defense
LU: Results
Star
0123456
1 2 4 8 16
Nodes
Seco
nds
MPI DSM
Paraski
0
1
2
3
4
1 2 4 8 16
NodesSeco
nds
MPI DSM
M.S. Thesis Defense
LU: Conclusions While the DSM version uses a similar
data distribution mechanism as the MPI version, there is significant overhead that is exposed when executing these methods in large loops.
This overhead is minimized on the Paraski cluster due to the nature Myrinet and BIP.
M.S. Thesis Defense
Successive Over-Relaxation Performs 100 iterations of SOR on
a 1000 x 1000 grid. A “red-black” ordering mechanism
allows array rows to be distributed to nodes in blocks.
After initial data distribution, only neighbor rows need be communicated during the SOR.
M.S. Thesis Defense
SOR: DSM Modifications Excessive remote thread object
access made it necessary to modify the benchmark.
Modified version uses arraycopy to update neighbor rows during the SOR.
When the SOR completes, arraycopy is used to assemble the final matrix on the root thread.
M.S. Thesis Defense
SOR: mpiJava MPI_Sendrecv is used to exchange
neighbor rows. MPI_Ssend and MPI_Recv are used
to build the final matrix on the root process.
M.S. Thesis Defense
SOR: Results
Star
0
5
10
15
20
1 2 4 8 16
Nodes
Seco
nds
MPI DSM
Paraski
0
5
10
15
1 2 4 8 16
NodesSeco
nds
MPI DSM
M.S. Thesis Defense
SOR: Conclusions The DSM version requires an extra barrier
after row neighbors are exchanged due to the “network reactivity” problem. A thread must be able to service all requests
in a timely fashion. If the thread is busy computing, it cannot react quickly enough to schedule the request thread.
The barrier will block all threads until each reaches the barrier, which guarantees that all nodes have their requested data and it is safe to continue with the computation.
M.S. Thesis Defense
IDEA Crypt Performs IDEA encryption and
decryption on a 3,000,000 byte array. The array is divided among nodes in
a block manner. Each node encrypts and decrypts its
portion. When complete, the root node
collects the decrypted array for validation.
M.S. Thesis Defense
Crypt: DSM Modifications The original created the whole array on
the root thread and required each remote thread to page in their portions.
The modified version used arraycopy to distribute each threads portion from the root thread.
When decryption finishes, arraycopy copies back the decrypted portion to the root thread.
M.S. Thesis Defense
Crypt: mpiJava The mpiJava version uses MPI_Ssend
to send the array portions to the remote processes and MPI_Recv to receive the portions.
When complete, MPI_Ssend is used to send back the processes portion and MPI_Recv receives each portion.
M.S. Thesis Defense
Crypt: Results
Star
0
1
2
3
4
5
1 2 4 8 16
Nodes
Seco
nds
MPI DSM
Paraski
0
1
2
3
4
1 2 4 8 16
NodesSeco
nds
MPI DSM
M.S. Thesis Defense
Crypt: Conclusions Results are similar on both clusters. There is a slight performance
problem with 4 and 8 nodes with the DSM version. This can be attributed to the placing of
a barrier that causes all threads to block before computing, in the DSM version, while the MPI version does not block.
M.S. Thesis Defense
Sparse Matrix Multiplication A 50,000 x 50,000 unstructured
matrix stored in compressed-row format is multiplied over 200 iterations.
Only the final result is communicated as each node has its own portion of data and initial distribution is not timed.
M.S. Thesis Defense
Sparse: DSM Modifications This benchmark originally
produced excessive network traffic through remote object access.
The modifications involved removing the object access during the multiplication loop and using arraycopy to distribute the final result to the root thread.
M.S. Thesis Defense
Sparse: mpiJava This benchmark’s only
communication is an MPI_Allreduce, which performs a array reduce leaving the result on all nodes. This is employed to obtain the final result of the multiplication.
M.S. Thesis Defense
Sparse: Results
Star
0
5
10
15
20
25
1 2 4 8 16
Nodes
Seco
nds
MPI DSM
Paraski
0
5
10
15
20
1 2 4 8 16
NodesSeco
nds
MPI DSM
M.S. Thesis Defense
Sparse: Conclusions The results are similar on both
clusters. The DSM version has better
performance on both cluster. The MPI version uses the MPI_Allreduce
method that places the results on all nodes.
This method adds extra overhead that is not present in the DSM version, where the results are just collected on the root node.
M.S. Thesis Defense
Molecular Dynamics This benchmark is a 2048-body
code that models particles interacting under a Lennard-Jones potential in a cubic spatial volume with periodic boundary conditions.
Parallelization is provided by dividing the range of iterations over the particles among nodes.
M.S. Thesis Defense
MolDyn: DSM Modifications Significant amount of remote thread
object access necessitated modifications.
Particle forces are updated by remote threads using arraycopy to send the forces to the root thread and the root thread serially updates the forces and sends the new force array back to the remote threads via arraycopy.
M.S. Thesis Defense
MolDyn: mpiJava This version uses six MPI_Allreduce
commands to update various force and movement arrays. This occurs at every time step.
M.S. Thesis Defense
MolDyn: Results
Star
05
1015202530
1 2 4 8 16
Nodes
Seco
nds
MPI DSM
Paraski
0
5
10
15
20
1 2 4 8 16
NodesSeco
nds
MPI DSM
M.S. Thesis Defense
MolDyn: Conclusions The DSM version must update particle
forces serially on the root thread. This causes all threads to block while sending
its local forces to the root thread and wait for the updated forces to be sent back.
The MPI version uses MPI_Allreduce to efficiently compute the forces among all the processes in a parallel fashion causing nodes to only block for its neighbor force updates.
M.S. Thesis Defense
Ray Tracer A scene with 64 spheres is
rendered at 150 x 150 pixels. Parallelization is provided by a
cyclic distribution for load balancing when looping over the rows of pixels.
M.S. Thesis Defense
Ray Tracer: DSM Modifications This benchmark was poorly designed as it
created far too many temporary objects. The DSM version was heavily modified to eliminate temporary object creation.
Broadcast is used to transmit the array row reference to each thread.
Once the rendering is complete, arraycopy is used to copy the row data to the root thread.
Reduce is used to compute the pixel checksum for validation purposes.
M.S. Thesis Defense
Ray Tracer: mpiJava The mpiJava version was also
modified to remove temporary object creation.
MPI_Reduce is used to generate the pixel checksum.
MPI_Send and MPI_Recv are used to transmit the row data to the root process.
M.S. Thesis Defense
Ray Tracer: Results
Star
05
1015202530
1 2 4 8 16
Nodes
Seco
nds
MPI DSM
Paraski
0
5
10
15
20
1 2 4 8 16
NodesSeco
nds
MPI DSM
M.S. Thesis Defense
Ray Tracer: Conclusions The results on both nodes are
almost identical. Very little network communication
is required. Most of the time is spent rendering the scene.
M.S. Thesis Defense
Monte Carlo Uses Monte Carlo techniques to
price products derived from the worth of an underlying asset.
Generates 2,000 samples. Parallelization is provided by
dividing the work in the principle loop of the Monte Carlo runs in block fashion and distributing the blocks to remote nodes.
M.S. Thesis Defense
Monte Carlo: Lack Of DSM This benchmark required an
unacceptable amount of memory for Hyperion to handle.
It is embarrassingly parallel and we have other embarrassingly parallel benchmarks.
We felt that it was unnecessary to develop a working DSM version from this large set of code.
M.S. Thesis Defense
Monte Carlo: mpiJava The mpiJava version provided some insight
into the mpiJava to Hyperion port, specifically in the object serialization portion.
Monte Carlo relies heavily on object serialization as Java objects are distributed via send and receive commands instead of primitive types.
By creating a working version of Monte Carlo, we were able to eliminate many subtle mpiJava to Hyperion port bugs.
M.S. Thesis Defense
Monte Carlo: Results
Star
0
20
40
60
80
1 2 4 8 16
Nodes
Seco
nds
MPI
Paraski
0102030405060
1 2 4 8 16
NodesSeco
nds
MPI
M.S. Thesis Defense
Monte Carlo: Conclusions The MPI version scales well on both
clusters. Without a DSM version, a
comparison is not possible.
M.S. Thesis Defense
Conclusions The Hyperion system can compete with
traditional parallel programming models. However, to compete, a Hyperion user
cannot simply write multi-threaded Java code and expect it to perform well on a cluster.
Users must be aware of how thread creation works in Hyperion and the effects of remote object access in order to tune their applications.
M.S. Thesis Defense
Conclusions, Continued With an application developed using
Hyperion’s thread management and data exchange techniques (reduce, broadcast, arraycopy), a Hyperion user can achieve competitive performance.
We feel that methods for performing operations on groups of threads, such as the above, should be part of the Java threading API, as they could be useful even outside a parallel environment.