Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences...

Post on 18-Jan-2018

223 views 0 download

description

3 Outline Wide-area parallel computing Java Remote Method Invocation (RMI) Performance of JDK RMI The Manta high-performance Java system Wide-area parallel Java applications using RMI Application performance

Transcript of Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences...

Wide-Area Parallel Computing in Java

Henri Bal

Vrije Universiteit Amsterdam

Faculty of Sciencesvrije Universiteit

2

Introduction

• Distributed supercomputing- Parallel applications on geographically distributed

computing system (computational grid)- Examples: SETI@home, RSA-155

• Programming support- Language-neutral systems: Legion, Globus- Language-centric: Java

• Goal: study wide-area parallel computing in Java- Programming model: Remote Method Invocation

3

Outline

• Wide-area parallel computing

• Java Remote Method Invocation (RMI)

• Performance of JDK RMI

• The Manta high-performance Java system

• Wide-area parallel Java applications using RMI

• Application performance

4

Wide-area parallel computing

• Challenge- Tolerating poor latency and bandwidth of WANs

• Basic assumption: wide-area system is hierarchical- Connect clusters, not individual workstations- Most links are fast

• General approach- Optimize applications to exploit hierarchical

structure most communication is local

5

Distributed ASCI SupercomputerVU (128) UvA (24)

Leiden (24) Delft (24)

6 Mb/sATM

Node configuration

200 MHz Pentium Pro64-128 MB memory2.5 GB local disksMyrinet LANRedhat Linux 2.0.36

6

Java

• Growing interest in Java for parallel applications- Java Grande forum

• Parallel programming support in Java- Shared memory : multithreading - Distributed memory : Remote Method Invocation

• Study suitability of Java RMI for (wide-area) parallel programming- Optimizing performance of local RMI [PPoPP’99]- Wide-area parallel programming using RMI

[JavaGrande’99]

7

RMI (1)

• Flexible object-oriented RPC-like primitive- Easy interoperability between Java Virtual

Machines- Polymorphism dynamic bytecode loading

void species(Animal x) throws … { System.out.println(“Species “ + x.name());}

o.species(new Orca()); “Species orca”

o.species(new Panda()); “Species panda”

o.species(new Manta()); “Species manta”

Animal

Orca

Panda

Manta

8

RMI (2)

• Designed for client-server applications

• Automatic serialization (marshalling)

• Normally used in a high latency environment

- E.g. Internet

• Is RMI fast enough for parallel programming ?

9

JDK RMI Performance

1711

1228

22830

0200400

600

8001000

1200

14001600

1800

Fast Ethernet Myrinet

Late

ncy

(mic

rose

cond

s)JDK RMI C RPC

( 200 MHz Pentium Pro, JDK 1.1.4 )

10

Why is JDK RMI slow ?

• Serialization uses run-time type inspection

• Protocol overhead (class information)

• Thread creation for incoming calls

• TCP/IP

• Most code is written in Java

11

The Manta system

• Designed for high-performance computing

• Native (static) compilation- Source executable

• New fast RMI protocol between Manta nodes

• Support (polymorphic) RMIs with JVMs

• Implemented on wide-area DAS system

12

JDK versus MantaJDK time

µsManta time

µs

Serialization Runtime 670 Compiler 11

RMI protocol Heavy-weight 950 Light-weight 10

Communication TCP/IP 280 RPC/LFC 30

200 MHz Pentium Pro, Myrinet, JDK 1.1.4 interpreter,1 object as parameter

13

Manta serializationclass Test implements Serializable { int i; double d; Object o;}

package java.io;

import java.util.Stack;

public class ObjectOutputStream extends OutputStream implements ObjectOutput, ObjectStreamConstants

{

public ObjectOutputStream (OutputStream out) throws IOException { this.out = out; dos = new DataOutputSt ream (t his); buf = new byte[1024]; writeStreamHeader (); resetStream (); }

public final void writeObject (Object obj) throws IOException { Object prevObject = currentObject; ObjectStreamClass prevClassDesc = currentClassDesc; boolean oldBlockDataMode = setBlockData (false); recursionDepth++;

try { if (serializeNullAndRepeat (obj))

return;

if (checkSpecialClasses (obj))return;

if (enableReplace){ Object altobj = replaceObject (obj); if (obj != altobj) {

if (!(altobj instanceof Serializable)){ String clname = altobj.getClass ().getName ( ); throw new NotSerializableException (clname);}

if (serializeNullAndRepeat (altobj)){ addReplacement (obj, altobj); return;}

addReplacement (obj, altobj);

if (checkSpecialClasses (altobj))return;

obj = altobj; }}

outputObject (obj); } catch (ObjectStreamException ee) { if (abortIOException == null)

{ try {

setBlockData (false);

writeCode (TC_EXCEPTION); resetStream (); th is.writeObject (ee); resetStream ();

abortIOException = ee; } catch (IOException fatal) {

abortIOException = new StreamCorruptedException (fatal.getMessage ()); }}

} catch (IOException ee) {

if (abortIOException == null)abortIOException = ee;

} finally {

recursionDepth--; currentObject = prevObject; currentClassDesc = prevClassDesc; setBlockData ( oldBlockDataMode); }

IOException pending = abortIOException; if (recursionDepth == 0) abortIOException = null; if (pending != null) {

throw pending; } }

private boolean checkSpecialClasses (Object obj) throws IOException {

if (obj instanceof Class) {

outputClass ((Class) obj);return true;

}

if (obj instanceof ObjectStreamClass) {

outputClassDescriptor ((ObjectSt reamClass) obj);return true;

}

if (obj instanceof String) {

outputSt ring ((String) obj);return true;

}

if (obj.getClass ().isArray ()) {

outputArray (obj);return true;

} return false; }

public final void defaultWriteObject () throws IOException { if (currentObject == null || currentClassDesc == null) throw new NotActiveException ("defaultWriteObject");

if (currentClassDesc.getFieldSequence () != null) {

boolean prevmode = setBlockDat a (false); outputClassFields (currentObject, currentClassDesc.forClass (),

currentClassDesc.getFieldSequence ()); setBlockData (prevmode);

} }

public void reset () throws IOException { if (currentObject != null || currentClassDesc != null) throw new IOException ("Illegal call to reset");

setBlockData (false); writeCode (TC_RESET);

resetStream (); abortIOException = null; }

private void resetStream () throws IOException { wireHandle2Object = new Object[100]; wireNextHandle = new int[100]; wireHash2Handle = new int[101]; for (int i = 0; i < wireHash2Handle.length; i++) {

wireHash2Handle[i] = -1; } classDescStack = new Stack (); nextWireOffset = 0; replaceObjects = null; nextReplaceOffset = 0; setBlockData (true); }

protected void annotateClass (Class cl) throws IOException { }

protected Object replaceObject (Object obj) throws IOException { return obj; }

protected final boolean enableReplaceObject (boolean enable) throws SecurityException { boolean previous = enableReplace; if (enable)

{ClassLoader loader = this.getClass ().getClassLoader ();if (loader == null) { enableReplace = true; return previous; }throw new SecurityException ("Not trusted class");

} else {

enableReplace = false; } return previous; }

protected void writeStreamHeader () throws IOException { writeShort (STREAM_MAGIC); writeShort (STREAM_VERSION); }

private void outputString (String s) throws IOException {

assignWireOffset (s); writeCode (TC_S TRING); writeUTF (s); }

private void outputClass (Class aclass) throws IOException {

writeCode (TC_CLASS);

ObjectStreamClass v = ObjectStreamClass.lookup (aclass);

if (v == null) throw new NotSerializableException (aclass.getName ());

outputClassDescriptor (v);

assignWireOffset (aclass); }

private void outputClassDescriptor (ObjectStreamClass classdesc) throws IOException { if (serializeNullAndRepeat (classdesc)) return;

writeCode (TC_CLASSDESC); String classname = classdesc.getName ();

writeUTF (classname); writeLong (classdesc.getSerialVersionUID ());

assignWireOffset (classdesc);

classdesc.write (this);

boolean prevMode = setBlockData (true); annotateClass (classdesc.forClass ()); setBlockData (prevMode); writeCode (TC_ENDBLOCKDATA);

ObjectStreamClass superdesc = classdesc.getSuperclass (); outputClassDescriptor (superdesc); }

private void outputArray (Object obj) throws IOException { Class currclass = obj.getClass ();

ObjectStreamClass v = ObjectStreamClass.lookup (currclass);

writeCode (TC_ARRAY); outputClassDescriptor (v);

assignWireOffset (obj);

int i, length; Class type = currclass.getComponentType ();

if (type.isPrimitive ()) {

if (type == Integer.TYPE) { int[] array = (int[]) obj; length = array.length; writeInt (length); for (i = 0; i < length; i++) {

writ eInt (array[i]); } }else if (type == Byte.TYPE) { byte[]array = (byte[])obj; length = array.length; writeInt (length); write (array, 0, length); }else if (type == Long.TYPE) { long[] array = (long[]) obj; length = array.length; writeInt (length); for (i = 0; i < length; i++) {

writ eLong (array[i]); } }else if (type == Float.TYPE) { float[] array = (float[]) obj; length = array.length; writeInt (length); for (i = 0; i < length; i++) {

writ eFloat (array[i]); } }else if (type == Double.TYPE) { double[] array = (double[]) obj; length = array.length; writeInt (length); for (i = 0; i < length; i++) {

writ eDouble (array[i]); } }else if (type == Short.TYPE) { short[] array = (short[]) obj; length = array.length; writeInt (length); for (i = 0; i < length; i++) {

writ eShort (array[i]); } }else if (type == Character.TYPE) { char[] array = (char[]) obj; length = array.length; writeInt (length); for (i = 0; i < length; i++) {

writ eChar (array[i]); } }else if (type == Boolean.TYPE) { boolean[]array = (boolean[])obj; length = array.length; writeInt (length); for (i = 0; i < length; i++) {

writ eBoolean (array[i]); } }else { throw new InvalidClassException (currclass.getName ()); }

} else {

Object[]array = (Object[])obj;length = array.length;writ eInt (length);for (i = 0; i < length; i++) { writeObject (array[i]); }

} }

private void outputObject (Object obj) throws IOException { currentObject = obj; Class currclass = obj.getClass ();

currentClassDesc = ObjectStreamClass.lookup (currclass); if (currentClassDesc == null) {

throw new NotSerializableException (currclass.getName ()); }

writeCode (TC_OBJECT); outputClassDescriptor (currentClassDesc);

assignWireOffset (obj);

if (currentClassDesc.isExternalizable ()) {

Externalizable ext = (Externalizable) obj;

ext.writeExternal (this); } else {

int stackMark = classDescStack.size ();try{ ObjectStreamClass next; while ((next = currentClassDesc.getSuperclass ()) != null) { classDescStack.push (currentClassDesc); currentClassDesc = next; }

do { if (currentClassDesc.hasWriteObject ())

{ setBlockData (true); invokeObjectWriter (obj, currentClassDesc.forClass

()); setBlockData (false); writeCode (TC_ENDBLOCKDATA);}

else{ defaultWriteObject ();}

} while (classDescStack.size () > stackMark &&

(currentClassDesc = (ObjectS treamClass)classDescStack.pop ()) != null);

}finally{ classDescStack.setSize (stackMark);}

} }

private boolean serializeNullAndRepeat (Object obj) throws IOException { if (obj == null) {

writeCode (TC_NULL);return true;

}

if (replaceObjects != null) {

for (int i = 0; i < nextReplaceOffset; i += 2) { if (replaceObjects[i] == obj) {

obj = replaceObjects[i + 1];break;

} }

}

int handle = findWireOffset (obj); if (handle >= 0) {

writeCode (TC_REFERENCE);writeInt (handle + baseWireHandle);return true;

} return false; }

private in t findWireOffset (Object obj) { int hash = S ystem.identityHashCode (obj); int index = (hash & 0x7FFFFFFF) % wireHash2Handle.length;

for (int handle = wireHash2Handle[index]; handle >= 0; handle = wireNextHandle[handle])

{

if (wireHandle2Object[handle] == obj) return handle;

} return -1; }

private void assignWireOffset (Object obj) throws IOException {

if (nextWireOffset == wireHandle2Object.length) {

Object[]o ldhandles = wireHandle2Object;wireHandle2Object = new Object[nextWireOffset * 2];System.arraycopy (oldhandles, 0,

wireHandle2Object, 0, nextWireOffset);

int[] oldnexthandles = wireNextHandle; wireNextHandle = new int[nextWireOffset * 2]; System.arraycopy ( oldnexthandles, 0,

wireNextHandle, 0, nextWireOffset);

} wireHandle2Object[nextWireOffset] = obj;

hashInsert (obj, nextWireOffset);

nextWireOffset++; return; }

private void hashInsert (Object obj, int offset) { int hash = S ystem.identityHashCode (obj); int index = (hash & 0x7FFFFFFF) % wireHash2Handle.length; wireNextHandle[offset] = wireHash2Handle[index]; wireHash2Handle[index] = offset; }

private void addReplacement (Object orig, Object replacement) {

if (replaceObjects == null) {

replaceObjects = new Object[10]; } if (nextReplaceOffset == replaceObjects.length) {

Object[]o ldhandles = replaceObject s;replaceObjects = new Object[2 + nextReplaceOffset * 2];System.arraycopy (o ldhandles, 0,

replaceObjects, 0, nextReplaceOffset);

} replaceObjects[nextReplaceOffset++] = orig; replaceObjects[nextReplaceOffset++] = replacement; }

private void writeCode (int tag) throws IOException { writeByte (tag); }

private boolean blockDataMode; private byte[] buf; private in t count; private OutputStream out;

public void write (int data) throws IOException {

if (count >= buf.length) drain (); buf[count++] = (byte) data; }

public void write (byte b[]) throws IOException { write (b, 0, b .length); }

public void write (byte b[], int off, int len) throws IOException { if (len < 0) throw new IndexOutOfBoundsException ();

int avail = buf.length - count; if (len <= avail) {

System.arraycopy (b, off, buf, count, len);count += len;

} else {

drain ();if (b lockDataMode) { if (len <= 255) {

out.write (TC_BLOCKDATA);out.write (len);

} else {

out.write (TC_BLOCKDATALONG);

out.write ((len >> 24) & 0xFF);out.write ((len >> 16) & 0xFF);out.write ((len >> 8) & 0xFF);out.write (len & 0xFF);

} }out.write (b, off, len);

} }

Manta JDK

void PackageClass__Test(…) { WRITE_INT( type_id ); WRITE_INT( i ); WRITE_DOUBLE( d ); WRITE_OBJECT( o );}

Java Source

14

RMI protocol

• Light-weight RMI protocol - Send minimal type information

• Avoid thread creation - Simple nonblocking methods executed directly

• Avoid interrupts- Poll network when processor is idle

• Everything is written in C

15

Communication software

• Panda user space RPC protocol

• LFC Myrinet control program- Similar to active messages- Implemented partly on Myrinet network

interfaces- Myrinet network interfaces mapped in user

space

Manta RMI

Panda RPC

LFC UDP

EthernetMyrinet

TCP

ATM

16

Interoperability with JVMs

• Manta RMI protocol incompatible with JDK

- Use fast RMI between Manta nodes- Use JDK-compliant protocol with JVMs

• Polymorphic RMI requires exchanging bytecodes- Also generate bytecodes when compiling a

program- Dynamically compile and link bytecodes into

running program

17

Null-RMI latency

1711

1228

23339,9

22830

0200400600800

10001200140016001800

Fast Ethernet Myrinet

Late

ncy

(mic

rose

cond

s)

JDK Manta C RPC

18

RMI Throughput

0,974,667,3

38,6

10,3

55,7

0,0

10,0

20,0

30,0

40,0

50,0

60,0

Fast Ethernet Myrinet

Thro

ughp

ut (M

byte

/sec

ond)

JDK Manta C RPC

19

Outline

• Wide-area parallel computing

• Java Remote Method Invocation (RMI)

• Performance of JDK RMI

• The Manta high-performance Java system

• Wide-area parallel Java applications using RMI

• Application performance

20

Protocol Null-latency(µsec)

Bandwidth(MByte/sec)

Myrinet LAN LFC 39.9 38.6

ATM WAN TCP/IP 5600 0.55

• 2 orders of magnitude between intra-cluster (LAN) and inter-cluster (WAN) communication performance

• Manta exposes hierarchical structure to application- Applications are optimized to reduce WAN-overhead

Manta on wide-area DAS

21

Wide-area programming

• Problem: how to tolerate difference between LAN and WAN performance

• Wide-area system is structured hierarchically- Most links are fast

• Approach: application-level optimizations that exploit the hierarchical structure- Reduce wide-area communication

22

Application experience

• Parallel applications- Successive overrelaxation (SOR)- All-pairs shortest paths problem (ASP)- Traveling salesperson problem (TSP)- Iterative Deepening A* (IDA*)

• Measurements on wide-area DAS- 1-4 clusters with 16 nodes- Comparison with single 64-node cluster

23

Successive Overrelaxation

• Red/black SOR- Neighbor communication, using RMI

• Problem: nodes at cluster-boundaries- Overlap wide-area communication with computation- RMI is synchronous use multithreading

Cluster 1 Cluster 2

CPU 3CPU 2CPU 1 CPU 6CPU 5CPU 4

40 5600 µsec

µs

24

All-pairs shortest paths

• Broadcast at beginning of each iteration• Problem: broadcasting over wide-area links

- Lack of broadcast in Java -> use spanning tree- Use coordinator node per cluster- Do asynchronous send to all remote

coordinators- Implemented using threads

Cluster 1 2 3

25

Traveling salesperson problem

• Replicated-worker style parallel search algorithm

• Problem: work distribution- Central job-queue has high overhead- Statically distribute jobs over clusters- Use centralized job-queue per cluster- Easy to express using RMI

1

2

3

26

Iterative Deepening A*

• Parallel search algorithm using work stealing• Problem: inter-cluster work stealing• Optimization: first look for work in local cluster

- Easy to express using RMI

Cluster 1 2

27

Performance

0

10

20

30

40

50

60

70

SOR ASP TSP IDA*

Spee

dup

1 x 16 CPUs

4 x 16CPUs

4 x 16 CPUs(optimized)1 x 64 CPUs

• Wide-area DAS system: 4 clusters of 16 CPUs• Comparison with single 16-node and 64-node

cluster

28

• Fast RMI possible through- Compiler-generated serialization, light-weight

communication & RMI protocols• Optimized wide-area applications are efficient

- Reduce wide-area communication, or hide its latency• Java RMI is easy to use, but some optimizations are

awkward to express- No asynchronous communication, collective comm.

• Programming systems should take hierarchical structure of wide-area systems into account

Conclusions

http://www.cs.vu.nl/manta

29

Performance breakdown Manta

227 232 235 243

11109105

2417

0

50

100

150

200

250

300

empty 1 object 2 objects 3 objects

Communication RMI Overhead Serialization

( Fast Ethernet )