Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire [email protected].

45
Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire [email protected]

Transcript of Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire [email protected].

Page 1: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

Experience with a Cluster JVM

Philip J. HatcherUniversity of New Hampshire

[email protected]

Page 2: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

2

Acknowledgements

• UNH students– Mark MacBeth and Keith McGuigan

• PM2 team– very effective and enjoyable

collaboration

Page 3: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

3

Traditional Parallel Programming

• Parallel programming supported by using serial language plus a “bag on the side”.– e.g. Fortran plus MPI

• Parallel programming supported by extending a serial language.– e.g. High Performance Fortran

Page 4: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

4

My History

• I spent years studying data-parallel extensions to C, such as C*.– Users never really accepted

extensions.– They found them too complex.– They wanted standard, well-

integrated solutions.

Page 5: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

5

Java is a good thing!

• Java is explicitly parallel!– Language includes a threaded

programming model.

• Java employs a relaxed memory model.– Consistency model aids an

implementation on distributed-memory parallel computers.

Page 6: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

6

Java Threads

• Threads are objects.• The class java.lang.Thread

contains all of the methods for initializing, running, suspending, querying and destroying threads.

Page 7: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

7

java.lang.Thread methods

• Thread() - constructor for thread object.

• start() - start the thread executing.• run() - method invoked by ‘start’.• stop(), suspend(), resume(), join(),

yield().• setPriority().

Page 8: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

8

Java Synchronization

• Java uses monitors, which protect a region of code by allowing only one thread at a time to execute it.

• Monitors utilize locks.• There is a lock associated with

each object.

Page 9: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

9

synchronized keyword

• synchronized ( Exp ) Block• public class Q {

synchronized void put(…) { … }}

Page 10: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

10

java.lang.Object methods

• wait() - the calling thread, which must hold the lock for the object, is placed in a wait set associated with the object. The lock is then released.

• notify() - an arbitrary thread in the wait set of this object is awakened and then competes again to get lock for object.

• notifyall() - all waiting threads awakened.

Page 11: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

11

Shared-Memory Model

• Java threads execute in a virtual shared memory.

• All threads are able to access all objects.

• But threads may not access each other’s stacks.

Page 12: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

12

Java Memory Consistency

• A variant of release consistency.• Threads can keep locally cached

copies of objects.• Consistency is provided by

requiring that:– a thread's object cache be flushed

upon entry to a monitor.– local modifications made to cached

objects be transmitted to the central memory when a thread exits a monitor.

Page 13: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

13

Problems with Java Threads

• Java support for threads is very low level.

• Java memory model is not very well understood.

Page 14: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

14

Threads API

• No condition variables.• No semaphores.• No barriers.• No collective operations on thread

groups (e.g. sum reduction).• No parallel collections.

Page 15: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

15

So…

• Using low-level operations can be difficult and error-prone.

• Everyone is “re-inventing the wheel” as they struggle to construct higher level abstractions.

Page 16: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

16

Java Specification Request 166

• Expert Group formed 01/23/02.• Goal is to provide

java.util.concurrent:– atomic variables– special-purpose locks, barriers,

semaphores and condition variables– queues and related collections for

multithreaded use– thread pools

Page 17: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

17

Java Memory Model

• Most programmers did not read Chapter 17 of the Java Language Specification.

• Those that did read it, did not fully understand it.

• Lots of code has been written that is not portable.

Page 18: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

18

For example,

• The Java Grande Forum distributes multithreaded Java benchmarks.

• These benchmarks utilize a barrier method implemented with volatile variables and “busy waiting”.

• However, benchmarks assume when volatile variable is set all of memory will also be made consistent. Not true!

Page 19: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

19

Implementors also struggled…

• In June 2000, IBM researchers suggested my cluster JVM violated the JMM, but could not cite an example.

• In July 2000, I produced a “proof” of correctness.

• In June 2001, a counter-example was found.

• Problem concerns properly handling “improperly synchronized” programs.

Page 20: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

20

Java Specification Request 133

• Expert Group formed 06/12/01.• Goal is to re-specify the Java memory

model:– Maintain relaxed consistency.– Loosen implementation requirements for

handling “improperly synchronized” programs.

– Fix ambiguities and holes.

• Current draft is still “rough sledding”!

Page 21: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

21

Cluster Implementation of Java• Single JVM running on a cluster of

machines.• Nodes of the cluster are

transparent.• Multithreaded applications exploit

multiple processors of cluster.

Page 22: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

22

Hyperion

• Cluster implementation of Java developed at the University of New Hampshire.

• Currently built on top of the PM2 distributed, multithreaded runtime environment from ENS-Lyon.

Page 23: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

23

General Hyperion Overview

Runtimelibraries

prog.java progjavac java2c gcc -06

libs

Sun'sJava compiler

prog.[ch]prog.class

(bytecode)

Instruction-wisetranslation

Page 24: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

24

The Hyperion Run-Time System• Collection of modules to allow

“plug-and-play” implementations:– inter-node communication– threads– memory and synchronization– etc

Page 25: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

25

Thread and Object Allocation

• Currently, threads are allocated to processors in round-robin fashion.

• Currently, an object is allocated to the processor that holds the thread that is creating the object.

• Currently, DSM-PM2 is used to implement the Java memory model.

Page 26: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

26

Hyperion Internal Structure

PM2 API: pm2_rpc, pm2_thread_create, etc.

Loadbalancer

NativeJava API

Threadsubsystem

Memorysubsystem

Comm.subsystem

PM2

DSM subsystem

Thread subsystem Comm. Subsystem

Page 27: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

27

PM2: A Distributed, Multithreaded Runtime Environment

• Thread library: Marcel– User-level– Supports SMP– POSIX-like

– Preemptive thread migration

• Communication library: Madeleine– Portable: BIP,

SISCI/SCI, MPI, TCP, PVM

– Efficient Context Switch Create

SMP 0.250 s 2 s

Non-SMP 0.120 s 0.55 s

Latency BandwidthSCI/SISCI 6 s 70 MB/sBIP/Myrinet 8 s 125 MB/s

Thread Migration SCI/SISCI 24 sBIP/Myrinet 75 s

Page 28: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

28

DSM-PM2: Architecture

• DSM comm:– send page request– send page– send invalidate request– …

• DSM page manager:– set/get page owner– set/get page access– add/remove to/from copyset– ...

DSM-PM2

MadeleineComms

MarcelThreads

DSM Protocol Policy

DSM Protocol lib

DSM Page Manager

DSM Comm

PM2

Page 29: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

29

DSM Implementation

• Node-level caches.• Page-based and home-based protocol.• Use page faults to detect remote

objects.• Log modifications made to remote

objects.• Each node allocates objects from a

different range of the virtual address space.

Page 30: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

30

Benchmarking

• Two Linux 2.2 clusters:– twelve 200 MHz Pentium Pro

processors connected by Myrinet switch and using BIP.

– six 450 MHz Pentium II processors connected by a SCI network and using SISCI.

• gcc 2.7.2.3 with -O6

Page 31: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

31

Pi (50M intervals)

0

2

4

6

8

10

12

1 2 4 6 8 10 12

Nodes

Second

s 200MHz/ BIP450MHz/ SCI

Page 32: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

32

Jacobi (1024x1024)

0

20

40

60

80

100

1 2 4 6 8 10 12

Nodes

Second

s 200MHz/ BIP450MHz/ SCI

Page 33: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

33

Traveling Salesperson (17 cities)

0200400600800

100012001400

1 2 4 6 8 10 12

Nodes

Second

s 200MHz/ BIP450MHz/ SCI

Page 34: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

34

All-pairs Shortest Path (2K nodes)

0

200

400

600

800

1000

1 2 4 6 8 10 12

Nodes

Second

s 200MHz/ BIP450MHz/ SCI

Page 35: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

35

Barnes-Hut (16K bodies)

020406080

100120140

1 2 4 6 8 10 12

Nodes

Second

s 200MHz/ BIP450MHz/ SCI

Page 36: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

36

Current Work

• Comparing Hyperion to mpiJava.• mpiJava is set of JNI wrappers to

MPI.• Using Java Grande Forum

benchmarks.• mpiJava implemented on top of

single-node version of Hyperion.• This controls for quality of bytecode

implementation.

Page 37: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

37

Problems with JGF Benchmarks

• Written with SMP hardware in mind.– bogus synchronization.– all data allocated by one thread.

• SMP is not the right model!

Page 38: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

38

An Alternative Model

• Programmer should be aware of the memory hierarchy.

• Do not require “magic” implementation.

• The thread is the correct level of abstraction:– If an object was created by a thread,

then the object is “near” the thread.– Otherwise the object might be “far” from

the thread.

Page 39: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

39

Efficiency and Portability

• Will not hurt on SMP hardware and may even help.

• Implementation can be straightforward.

• “Magic” implementations also possible.

• Encourages portability across different implementations and hardware.

Page 40: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

40

Other Lessons Learned

• in-line checks vs. page faults• network reactivity• System.arraycopy

Page 41: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

41

In-line Checks vs. Page Faults

• Earlier version of Hyperion used in-line checks to detect remote objects.

• For our benchmarks, using page faults was always better.

• Local accesses are free.• Remote accesses are more

expensive.• But most accesses are local!

Page 42: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

42

Network Reactivity

• Fetch of remote object implemented by asynchronous message to home node.

• Message handled by service thread on home node.

• When message arrives, service thread needs to be scheduled.

• Need integration of network layer and thread scheduler.

Page 43: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

43

Short-term Solution

• Over-synchronize:– use BSP programming style– distinct phases for communication and

computation– phases separated by barrier

synchronization– so only service thread ready to run

during communication phase

Page 44: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

44

System.arraycopy

• Native implementation can transmit data in units that are bigger than a page.

• Requires in-line check but usually amortized over large amount of data.

Page 45: Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire hatcher@unh.edu.

45

Conclusions

• Java threads is an attractive vehicle for parallel programming.

• Is Java serial execution fast enough?– Need true multi-dimensional arrays?

• Need clarified memory model.• Need extended thread API.• Programmers need to be aware of

memory hierarchy.