NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory...

21
NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory http://www.cs.utk.edu/netsolve
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    221
  • download

    0

Transcript of NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory...

Page 1: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory .

NetSolve

Henri Casanova and Jack DongarraUniversity of Tennessee and Oak Ridge National Laboratoryhttp://www.cs.utk.edu/netsolve

Page 2: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory .

Objectives

Harnessing vast computational resources on the network Hardware Software

Convenient for scientific computing community Reducing installation and programming

overhead Masking complexity related to distributed

computing

Page 3: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory .

Computation-Sharing Models Proxy Computing

Data

CodeDataCode

Client Server

Computation on the server

Page 4: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory .

Computation-Sharing ModelsCode Shipping

CodeData

Client Server

Computation on the client

Code

Page 5: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory .

Computation-Sharing ModelsRemote Computation

DataData

Client Server

Computation on the server

Code

Page 6: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory .

Design issues

Platform independence to accommodate heterogeneityUser friendlyExtensibilityLoad balancingFault tolerance

Page 7: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory .

NetSolve Architecture

“OS”

Resources

Page 8: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory .

NetSolve Organization and Operation

Page 9: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory .

NetSolve Client Interface

C, Fortran, Java, Matlab, and Mathematica

>> a = rand(100); b= rand(100,1);>> x = netsolve(’ax = b’, a, b);

>> a = rand(100); b= rand(100,1);>> request = netsolve_nb (’send’, ’ax = b’, a, b);>> x = netsolve_nb(’probe’, request);

Not ready>> x= netsolve_nb(’wait’, request);

Page 10: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory .

NetSolve Wrappers

Problem description file for extensibility@PROBLEM ipars@INCLUDE ”ipars.h”@LIB /home/user/lib/libipars.a@DECRIPTIONParallel Sub-Surface Flow Simulator@INPUT 2@OBJECT STRING CHAR model@OBJECT FILE CHAR infile

Compiled into wrappers around scientific librariesXDR for platform-independent data transfer

Page 11: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory .

NetSolve Load Balancing

Assigning a task to the “best” machine Establishing a performance model

Network delay, server properties, task properties Measuring and monitoring dynamic system

states

Load balancing at a finer granularity Parallelism through non-blocking interface Task migration

Page 12: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory .

NetSolve Fault Tolerance

Inter-server fault toleranceFault tolerance among NetSolve

servers

Intra-server fault toleranceFault tolerance within a NetSolve

server

Page 13: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory .

NetSolve Fault Tolerance Inter-server Fault Tolerance

Performed by NetSolve agentsBasic approach Failure detection + task reallocation Overload detection + task migration

Introducing NetSolve storage servers Store checkpoints or any information related

to fault tolerance (must be platform-independent)

No reliance on failed or overloaded server for task migration

Page 14: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory .

NetSolve Fault ToleranceIntra-server Fault Tolerance

Not a new problemCould be invisible to NetSolveCan take advantage of platform-specific features for fault tolerancePossible integration with inter-server fault tolerance

Page 15: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory .

Diskless Checkpointing Checksums and Reverse Computation

Diskless checkpointing eliminates the need for stable storageN servers + a checkpointing server At any point, consistent checkpoints taken

at N servers (stored in memory) A checksum of checkpoints stored at the

checkpointing server Rollback using reverse computation State recovery using the checksum

Page 16: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory .

Applications

MCell with NetSolveLarge code, small data

Matlab with NetSolveTradeoffs between parallelism and

overhead

IPARS with NetSolveImageVision with NetSolve

Page 17: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory .
Page 18: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory .

Integration with ScaLAPACK

Page 19: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory .

Integration with Condor

Page 20: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory .

Integration with Ninf

Page 21: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory .

Conclusion

An interesting infrastructure for sharing computational resourcesBoth software and hardware

Convenience, performance, and reliabilityPlayground for fault tolerance Both general and specific