Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig...

Using Charm++ to Mask Latency in Grid Computing Applications

Gregory A. Koenig ([email protected])Parallel Programming LaboratoryDepartment of Computer ScienceUniversity of Illinois at Urbana-Champaign

2004 Charm++ Workshop

Problem: Latency Tolerance for Multi-Cluster Applications

Goal: Good performance for tightly-coupled applications running across multiple clusters single campus Grid environment

Scenarios Very large applications On-demand computing

Challenge: Masking the effects of latency on inter-cluster messages

Cluster A Cluster B

Intra-cluster latency (microseconds)

Inter-cluster latency (milliseconds)

Solution: Processor Virtualization Charm++ chares and Adaptive MPI threads virtualize the

notion of a processor.

A programmer decomposes a program into a large number of virtual processors.

The adaptive runtime system maps virtual processors onto physical processors; the runtime may adjust this mapping as the program executes (load balancing).

If one virtual processor that is mapped to a physical processor cannot make progress, some other virtual processor on the same physical processor may be able to do useful work.

No modification of application software or problem-specific tricks are necessary!

Hypothetical Timeline View of a Multi-Cluster Computation

A

B

C

cross-cluster boundary

Processors A and B are on one cluster, Processor C on a second cluster

Communication between clusters via high-latency WAN Processor Virtualization allows latency to be masked

Charm++ on Virtual Machine Interface (VMI) Message data are

passed along VMI “send chain” and “receive chain”

Devices on each chain may deliver data directly, manipulate data, and/or pass data to next device

Application

Charm++

Converse(machine layer)

VMI

send c

hain

rece

ive ch

ain

AMPI

Description of Experiments Experimental environment

Artificial latency environment: VMI “delay device” adds a pre-defined latency between arbitrary pairs of nodes

TeraGrid environment: Experiments run between NCSA and ANL machines (~1.725 ms one-way latency)

Experiments Five-point stencil (2D Jacobi) for matrix sizes

2048x2048 and 8192x8192 LeanMD molecular dynamics code running a

30,652 atom system

Five-Point Stencil Results (P=2)

LeanMD Results

Conclusion Processor virtualization is a useful

technique for masking latency in grid computing environments.

Future Work Testing across NCSA-SDSC Leverage Charm++ prioritized messages Grid-topology-aware load balancer Processor speed normalization Leverage Adaptive MPI

Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig...

Documents

Transcript of Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig...