Download - Mathew Reno

M.S. Thesis Defense

Comparing The Performance Of Distributed Shared Memory And Message Passing Programs Using The Hyperion Java Virtual Machine On Clusters

Mathew Reno

M.S. Thesis Defense

Overview For this thesis we wanted to evaluate the

performance of the Hyperion distributed virtual machine, designed at UNH, when compared to a preexisting parallel computing API.

The results would indicate where Hyperion’s strength and weaknesses were and possibly validate Hyperion as a high-performance computing alternative.

M.S. Thesis Defense

What Is A Cluster? A cluster is a group of low cost

computers connected with an “off the shelf” network.

The cluster’s network is isolated from WAN data traffic and the computers on the cluster are presented to the user as a single resource.

M.S. Thesis Defense

Why Use Clusters? Clusters are cost effective when

compared to traditional parallel systems.

Clusters can be grown as needed. Software components are based on

standards allowing portable software to be designed for the cluster.

M.S. Thesis Defense

Cluster Computing Cluster computing takes advantage

of the cluster by distributing computational workload among nodes of the cluster, thereby reducing total computation time.

There are many programming models for distributing data throughout the cluster.

M.S. Thesis Defense

Distributed Shared Memory Distributed Shared Memory (DSM)

allows the user to view the whole cluster as one resource.

Memory is shared among the nodes. Each node has access to all other nodes memory as if it owns it.

Data coordination among nodes is generally hidden from the user.

M.S. Thesis Defense

Message Passing Message Passing (MP) requires

explicit messages to be employed to distribute data throughout the cluster.

The programmer must coordinate all data exchanges when designing the application through a language level MP API.

M.S. Thesis Defense

Related: Treadmarks Vs. PVM Treadmarks (Rice, 1995) implements a

DSM model while PVM implements a MP model. The two approaches were compared with benchmarks. On average, PVM was found to perform two

times better the Treadmarks. Treadmarks suffered from excessive messages

that were required for the request response communication DSM model employed.

Treadmarks was found to be more natural to program with saving development time.

M.S. Thesis Defense

Hyperion Hyperion is a distributed Java Virtual

Machine (JVM), designed at UNH. The Java language provides parallelism

through its threading model. Hyperion extends this model by distributing the threads among the cluster.

Hyperion implements the DSM model via DSM-PM2, which allows for lightweight thread creation and data distribution.

M.S. Thesis Defense

Hyperion, Continued Hyperion has a fixed memory size that it

shares with all threads executing across the cluster.

Hyperion uses page based data distribution; if a thread accesses memory it does not have locally, a page fault occurs and the memory is transmitted from the node that owns the memory to the requesting node a page at a time.

M.S. Thesis Defense

Hyperion, Continued Hyperion translates Java bytecodes

into native C code. A native executable is generated by

a native C compiler. The belief is that native executables

are optimized by the C compiler and will benefit the application by executing faster than interpreted code.

M.S. Thesis Defense

Hyperion’s Threads Threads are created in a round robin

fashion among the nodes of the cluster. Data is transmitted between threads via a

request/response mechanism. This approach requires two messages.

In order to respond to a request message, a response thread must be scheduled. This thread handles the request by sending back the requested data in a response message.

M.S. Thesis Defense

mpiJava mpiJava is a Java wrapper for the

Message Passing Interface (MPI). The Java Native Interface (JNI) is

used to translate between Java and native code.

We used MPICH for the native MPI implementation.

M.S. Thesis Defense

Clusters The “Star” cluster (UNH) consists of

16 PIII 667MHz Linux PCs on a 100Mb Fast Ethernet network. TCP is communication protocol.

The “Paraski” cluster (France) consists of 16 PIII 1GHz Linux PCs on a 2Gb Myrinet network. BIP (DSM) and GM (MPI) are the communication protocols.

M.S. Thesis Defense

Clusters, Continued The implementation of MPICH on

BIP was not stable in time for this thesis. GM had to be used in place of BIP for MPICH. GM has not been ported to Hyperion and a port would be unreasonable at this time.

BIP performs better than GM as the message size increases.

M.S. Thesis Defense

BIP vs. GM Latency (Paraski)

0

100

200

300

400

500

600

0 1 2 3 4 5 6 7 8

Message Size (KB)

Roundtr

ip L

ate

ncy

(μ

s)

MPI (GM) PM2 (BIP)

M.S. Thesis Defense

DSM & MPI In Hyperion For consistency, mpiJava was ported

into Hyperion. Both DSM and MPI versions of the

benchmarks could be compiled by Hyperion.

The executables produced by Hyperion are then executed by the respective native launchers (PM2 and MPICH).

M.S. Thesis Defense

Benchmarks The Java Grande Forum (JGF)

developed a suite of benchmarks to test Java implementations.

We used two of the JGF benchmark suites, multithreaded and javaMPI.

M.S. Thesis Defense

Benchmarks, Continued Benchmarks used:

Fourier coefficient analysis Lower/upper matrix factorization Successive over-relaxation IDEA encryption Sparse matrix multiplication Molecular dynamics simulation 3D Ray Tracer Monte Carlo simulation (only with MPI)

M.S. Thesis Defense

Benchmarks And Hyperion The multi-threaded JGF benchmarks had

unacceptable performance when run “out of the box”. Each benchmark creates all of its data objects

on the root node causing all remote object access to occur through this one node.

This type of access causes a performance bottleneck on the root node as it has to service all the requests while calculating its algorithm part.

The solution was to modify the benchmarks to be cluster aware.

M.S. Thesis Defense

Hyperion Extensions Hyperion makes up for Java’s

limited thread data management by providing efficient reduce and broadcast mechanisms.

Hyperion also provides a cluster aware implementation of arraycopy.

M.S. Thesis Defense

Hyperion Extension: Reduce Reduce blocks all enrolled threads until

each thread has the final result of the reduce.

This is done by neighbor threads exchanging their data for computation, then their neighbors, and so on until each thread has the same answer.

This operation is faster and scales well as opposed to performing the calculation serially. The operation is LogP.

M.S. Thesis Defense

Hyperion Extension: Broadcast The broadcast mechanism

transmits the same data to all enrolled threads.

Like reduce, data is distributed to the threads in a LogP fashion, which scales better than serial distribution of data.

M.S. Thesis Defense

Hyperion Extension: Arraycopy The arraycopy method is part of the

Java System class. The Hyperion version was extended to be cluster aware.

If data is copied across threads, this version will send all data as one message instead of relying on paging mechanisms to access remote array data.

M.S. Thesis Defense

Benchmark Modifications The multithreaded benchmarks

had unacceptable performance. The benchmarks were modified in

order to reduce remote object access and root node bottlenecks.

Techniques, such as arraycopy, broadcast and reduce were employed to improve performance.

M.S. Thesis Defense

Experiment Each benchmark was executed 50

times at each node size to provide a sample mean.

Node sizes were 1, 2, 4, 8, and 16. Confidence intervals (95% level)

were used to determine which version, MPI or DSM, performed better.

M.S. Thesis Defense

Results On The Star Cluster

0

0.5

1

1.5

2

2.5

3

3.5

4

1 2 4 8 16

Number of Nodes

Perf

orm

ance

Rati

o

(DSM

/M

PI)

CryptLUFactMoldynSeriesSORSparseRayTracer

M.S. Thesis Defense

Results On The Paraski Cluster

00.20.40.60.8

11.21.41.61.8

1 2 4 8 16

Number of Nodes

Perf

orm

ance

Rati

o

(DSM

/M

PI)

CryptLUFactMoldynSeriesSORSparseRayTracer

M.S. Thesis Defense

Fourier Coefficient Analysis Calculates the first 10,000 pairs of

Fourier coefficients. Each node is responsible for

calculating its portion of the coefficient array.

Each node sends back its array portion to the root node, which accumulates the final array.

M.S. Thesis Defense

Fourier: DSM Modifications The original multithreaded version

required all threads to update arrays located on the root node, causing the root node to be flooded with requests.

The modified version used arraycopy to distribute the local arrays back to the root thread arrays.

M.S. Thesis Defense

Fourier: mpiJava The mpiJava version is similar to

the DSM version. Each process is responsible for its

portion of the arrays. MPI_Ssend and MPI_Recv were

called to distribute the array portions to the root process.

M.S. Thesis Defense

Fourier: Results

Star

0

5

10

15

20

1 2 4 8 16

Nodes

Seco

nds

MPI DSM

Paraski

02468

1012

1 2 4 8 16

NodesSeco

nds

MPI DSM

M.S. Thesis Defense

Fourier: Conclusions Most of the time in this benchmark

is spent in the computation. Network communication does not

play a significant role in the overall time.

Both MPI and DSM perform similar on each cluster, scaling well when more nodes are added.

M.S. Thesis Defense

Lower/Upper Factorization Solves a 500 x 500 linear system

with LU factorization followed with a triangular solve.

The factorization is parallelized while the triangular solve is computed in serial.

M.S. Thesis Defense

LU: DSM Modifications The original version created the matrix

on the root thread and all access was through this thread, causing performance bottlenecks.

The benchmark was modified to use Hyperion’s Broadcast facility to distribute the pivot information and arraycopy was used to coordinate the final data for the solve.

M.S. Thesis Defense

LU: mpiJava MPI_Bcast is used to distribute the

pivot information. MPI_Send and MPI_Recv are used

so the root process can acquire the final matrix.

M.S. Thesis Defense

LU: Results

Star

0123456

1 2 4 8 16

Nodes

Seco

nds

MPI DSM

Paraski

0

1

2

3

4

1 2 4 8 16

NodesSeco

nds

MPI DSM

M.S. Thesis Defense

LU: Conclusions While the DSM version uses a similar

data distribution mechanism as the MPI version, there is significant overhead that is exposed when executing these methods in large loops.

This overhead is minimized on the Paraski cluster due to the nature Myrinet and BIP.

M.S. Thesis Defense

Successive Over-Relaxation Performs 100 iterations of SOR on

a 1000 x 1000 grid. A “red-black” ordering mechanism

allows array rows to be distributed to nodes in blocks.

After initial data distribution, only neighbor rows need be communicated during the SOR.

M.S. Thesis Defense

SOR: DSM Modifications Excessive remote thread object

access made it necessary to modify the benchmark.

Modified version uses arraycopy to update neighbor rows during the SOR.

When the SOR completes, arraycopy is used to assemble the final matrix on the root thread.

M.S. Thesis Defense

SOR: mpiJava MPI_Sendrecv is used to exchange

neighbor rows. MPI_Ssend and MPI_Recv are used

to build the final matrix on the root process.

M.S. Thesis Defense

SOR: Results

Star

0

5

10

15

20

1 2 4 8 16

Nodes

Seco

nds

MPI DSM

Paraski

0

5

10

15

1 2 4 8 16

NodesSeco

nds

MPI DSM

M.S. Thesis Defense

SOR: Conclusions The DSM version requires an extra barrier

after row neighbors are exchanged due to the “network reactivity” problem. A thread must be able to service all requests

in a timely fashion. If the thread is busy computing, it cannot react quickly enough to schedule the request thread.

The barrier will block all threads until each reaches the barrier, which guarantees that all nodes have their requested data and it is safe to continue with the computation.

M.S. Thesis Defense

IDEA Crypt Performs IDEA encryption and

decryption on a 3,000,000 byte array. The array is divided among nodes in

a block manner. Each node encrypts and decrypts its

portion. When complete, the root node

collects the decrypted array for validation.

M.S. Thesis Defense

Crypt: DSM Modifications The original created the whole array on

the root thread and required each remote thread to page in their portions.

The modified version used arraycopy to distribute each threads portion from the root thread.

When decryption finishes, arraycopy copies back the decrypted portion to the root thread.

M.S. Thesis Defense

Crypt: mpiJava The mpiJava version uses MPI_Ssend

to send the array portions to the remote processes and MPI_Recv to receive the portions.

When complete, MPI_Ssend is used to send back the processes portion and MPI_Recv receives each portion.

M.S. Thesis Defense

Crypt: Results

Star

0

1

2

3

4

5

1 2 4 8 16

Nodes

Seco

nds

MPI DSM

Paraski

0

1

2

3

4

1 2 4 8 16

NodesSeco

nds

MPI DSM

M.S. Thesis Defense

Crypt: Conclusions Results are similar on both clusters. There is a slight performance

problem with 4 and 8 nodes with the DSM version. This can be attributed to the placing of

a barrier that causes all threads to block before computing, in the DSM version, while the MPI version does not block.

M.S. Thesis Defense

Sparse Matrix Multiplication A 50,000 x 50,000 unstructured

matrix stored in compressed-row format is multiplied over 200 iterations.

Only the final result is communicated as each node has its own portion of data and initial distribution is not timed.

M.S. Thesis Defense

Sparse: DSM Modifications This benchmark originally

produced excessive network traffic through remote object access.

The modifications involved removing the object access during the multiplication loop and using arraycopy to distribute the final result to the root thread.

M.S. Thesis Defense

Sparse: mpiJava This benchmark’s only

communication is an MPI_Allreduce, which performs a array reduce leaving the result on all nodes. This is employed to obtain the final result of the multiplication.

M.S. Thesis Defense

Sparse: Results

Star

0

5

10

15

20

25

1 2 4 8 16

Nodes

Seco

nds

MPI DSM

Paraski

0

5

10

15

20

1 2 4 8 16

NodesSeco

nds

MPI DSM

M.S. Thesis Defense

Sparse: Conclusions The results are similar on both

clusters. The DSM version has better

performance on both cluster. The MPI version uses the MPI_Allreduce

method that places the results on all nodes.

This method adds extra overhead that is not present in the DSM version, where the results are just collected on the root node.

M.S. Thesis Defense

Molecular Dynamics This benchmark is a 2048-body

code that models particles interacting under a Lennard-Jones potential in a cubic spatial volume with periodic boundary conditions.

Parallelization is provided by dividing the range of iterations over the particles among nodes.

M.S. Thesis Defense

MolDyn: DSM Modifications Significant amount of remote thread

object access necessitated modifications.

Particle forces are updated by remote threads using arraycopy to send the forces to the root thread and the root thread serially updates the forces and sends the new force array back to the remote threads via arraycopy.

M.S. Thesis Defense

MolDyn: mpiJava This version uses six MPI_Allreduce

commands to update various force and movement arrays. This occurs at every time step.

M.S. Thesis Defense

MolDyn: Results

Star

05

1015202530

1 2 4 8 16

Nodes

Seco

nds

MPI DSM

Paraski

0

5

10

15

20

1 2 4 8 16

NodesSeco

nds

MPI DSM

M.S. Thesis Defense

MolDyn: Conclusions The DSM version must update particle

forces serially on the root thread. This causes all threads to block while sending

its local forces to the root thread and wait for the updated forces to be sent back.

The MPI version uses MPI_Allreduce to efficiently compute the forces among all the processes in a parallel fashion causing nodes to only block for its neighbor force updates.

M.S. Thesis Defense

Ray Tracer A scene with 64 spheres is

rendered at 150 x 150 pixels. Parallelization is provided by a

cyclic distribution for load balancing when looping over the rows of pixels.

M.S. Thesis Defense

Ray Tracer: DSM Modifications This benchmark was poorly designed as it

created far too many temporary objects. The DSM version was heavily modified to eliminate temporary object creation.

Broadcast is used to transmit the array row reference to each thread.

Once the rendering is complete, arraycopy is used to copy the row data to the root thread.

Reduce is used to compute the pixel checksum for validation purposes.

M.S. Thesis Defense

Ray Tracer: mpiJava The mpiJava version was also

modified to remove temporary object creation.

MPI_Reduce is used to generate the pixel checksum.

MPI_Send and MPI_Recv are used to transmit the row data to the root process.

M.S. Thesis Defense

Ray Tracer: Results

Star

05

1015202530

1 2 4 8 16

Nodes

Seco

nds

MPI DSM

Paraski

0

5

10

15

20

1 2 4 8 16

NodesSeco

nds

MPI DSM

M.S. Thesis Defense

Ray Tracer: Conclusions The results on both nodes are

almost identical. Very little network communication

is required. Most of the time is spent rendering the scene.

M.S. Thesis Defense

Monte Carlo Uses Monte Carlo techniques to

price products derived from the worth of an underlying asset.

Generates 2,000 samples. Parallelization is provided by

dividing the work in the principle loop of the Monte Carlo runs in block fashion and distributing the blocks to remote nodes.

M.S. Thesis Defense

Monte Carlo: Lack Of DSM This benchmark required an

unacceptable amount of memory for Hyperion to handle.

It is embarrassingly parallel and we have other embarrassingly parallel benchmarks.

We felt that it was unnecessary to develop a working DSM version from this large set of code.

M.S. Thesis Defense

Monte Carlo: mpiJava The mpiJava version provided some insight

into the mpiJava to Hyperion port, specifically in the object serialization portion.

Monte Carlo relies heavily on object serialization as Java objects are distributed via send and receive commands instead of primitive types.

By creating a working version of Monte Carlo, we were able to eliminate many subtle mpiJava to Hyperion port bugs.

M.S. Thesis Defense

Monte Carlo: Results

Star

0

20

40

60

80

1 2 4 8 16

Nodes

Seco

nds

MPI

Paraski

0102030405060

1 2 4 8 16

NodesSeco

nds

MPI

M.S. Thesis Defense

Monte Carlo: Conclusions The MPI version scales well on both

clusters. Without a DSM version, a

comparison is not possible.

M.S. Thesis Defense

Conclusions The Hyperion system can compete with

traditional parallel programming models. However, to compete, a Hyperion user

cannot simply write multi-threaded Java code and expect it to perform well on a cluster.

Users must be aware of how thread creation works in Hyperion and the effects of remote object access in order to tune their applications.

M.S. Thesis Defense

Conclusions, Continued With an application developed using

Hyperion’s thread management and data exchange techniques (reduce, broadcast, arraycopy), a Hyperion user can achieve competitive performance.

We feel that methods for performing operations on groups of threads, such as the above, should be part of the Java threading API, as they could be useful even outside a parallel environment.