Mrs. Hatcher Modern European History Mrs. Hatcher Modern European History.
Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire [email protected].
-
Upload
samuel-kelley -
Category
Documents
-
view
225 -
download
0
Transcript of Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire [email protected].
2
Acknowledgements
• UNH students– Mark MacBeth and Keith McGuigan
• PM2 team– very effective and enjoyable
collaboration
3
Traditional Parallel Programming
• Parallel programming supported by using serial language plus a “bag on the side”.– e.g. Fortran plus MPI
• Parallel programming supported by extending a serial language.– e.g. High Performance Fortran
4
My History
• I spent years studying data-parallel extensions to C, such as C*.– Users never really accepted
extensions.– They found them too complex.– They wanted standard, well-
integrated solutions.
5
Java is a good thing!
• Java is explicitly parallel!– Language includes a threaded
programming model.
• Java employs a relaxed memory model.– Consistency model aids an
implementation on distributed-memory parallel computers.
6
Java Threads
• Threads are objects.• The class java.lang.Thread
contains all of the methods for initializing, running, suspending, querying and destroying threads.
7
java.lang.Thread methods
• Thread() - constructor for thread object.
• start() - start the thread executing.• run() - method invoked by ‘start’.• stop(), suspend(), resume(), join(),
yield().• setPriority().
8
Java Synchronization
• Java uses monitors, which protect a region of code by allowing only one thread at a time to execute it.
• Monitors utilize locks.• There is a lock associated with
each object.
9
synchronized keyword
• synchronized ( Exp ) Block• public class Q {
synchronized void put(…) { … }}
10
java.lang.Object methods
• wait() - the calling thread, which must hold the lock for the object, is placed in a wait set associated with the object. The lock is then released.
• notify() - an arbitrary thread in the wait set of this object is awakened and then competes again to get lock for object.
• notifyall() - all waiting threads awakened.
11
Shared-Memory Model
• Java threads execute in a virtual shared memory.
• All threads are able to access all objects.
• But threads may not access each other’s stacks.
12
Java Memory Consistency
• A variant of release consistency.• Threads can keep locally cached
copies of objects.• Consistency is provided by
requiring that:– a thread's object cache be flushed
upon entry to a monitor.– local modifications made to cached
objects be transmitted to the central memory when a thread exits a monitor.
13
Problems with Java Threads
• Java support for threads is very low level.
• Java memory model is not very well understood.
14
Threads API
• No condition variables.• No semaphores.• No barriers.• No collective operations on thread
groups (e.g. sum reduction).• No parallel collections.
15
So…
• Using low-level operations can be difficult and error-prone.
• Everyone is “re-inventing the wheel” as they struggle to construct higher level abstractions.
16
Java Specification Request 166
• Expert Group formed 01/23/02.• Goal is to provide
java.util.concurrent:– atomic variables– special-purpose locks, barriers,
semaphores and condition variables– queues and related collections for
multithreaded use– thread pools
17
Java Memory Model
• Most programmers did not read Chapter 17 of the Java Language Specification.
• Those that did read it, did not fully understand it.
• Lots of code has been written that is not portable.
18
For example,
• The Java Grande Forum distributes multithreaded Java benchmarks.
• These benchmarks utilize a barrier method implemented with volatile variables and “busy waiting”.
• However, benchmarks assume when volatile variable is set all of memory will also be made consistent. Not true!
19
Implementors also struggled…
• In June 2000, IBM researchers suggested my cluster JVM violated the JMM, but could not cite an example.
• In July 2000, I produced a “proof” of correctness.
• In June 2001, a counter-example was found.
• Problem concerns properly handling “improperly synchronized” programs.
20
Java Specification Request 133
• Expert Group formed 06/12/01.• Goal is to re-specify the Java memory
model:– Maintain relaxed consistency.– Loosen implementation requirements for
handling “improperly synchronized” programs.
– Fix ambiguities and holes.
• Current draft is still “rough sledding”!
21
Cluster Implementation of Java• Single JVM running on a cluster of
machines.• Nodes of the cluster are
transparent.• Multithreaded applications exploit
multiple processors of cluster.
22
Hyperion
• Cluster implementation of Java developed at the University of New Hampshire.
• Currently built on top of the PM2 distributed, multithreaded runtime environment from ENS-Lyon.
23
General Hyperion Overview
Runtimelibraries
prog.java progjavac java2c gcc -06
libs
Sun'sJava compiler
prog.[ch]prog.class
(bytecode)
Instruction-wisetranslation
24
The Hyperion Run-Time System• Collection of modules to allow
“plug-and-play” implementations:– inter-node communication– threads– memory and synchronization– etc
25
Thread and Object Allocation
• Currently, threads are allocated to processors in round-robin fashion.
• Currently, an object is allocated to the processor that holds the thread that is creating the object.
• Currently, DSM-PM2 is used to implement the Java memory model.
26
Hyperion Internal Structure
PM2 API: pm2_rpc, pm2_thread_create, etc.
Loadbalancer
NativeJava API
Threadsubsystem
Memorysubsystem
Comm.subsystem
PM2
DSM subsystem
Thread subsystem Comm. Subsystem
27
PM2: A Distributed, Multithreaded Runtime Environment
• Thread library: Marcel– User-level– Supports SMP– POSIX-like
– Preemptive thread migration
• Communication library: Madeleine– Portable: BIP,
SISCI/SCI, MPI, TCP, PVM
– Efficient Context Switch Create
SMP 0.250 s 2 s
Non-SMP 0.120 s 0.55 s
Latency BandwidthSCI/SISCI 6 s 70 MB/sBIP/Myrinet 8 s 125 MB/s
Thread Migration SCI/SISCI 24 sBIP/Myrinet 75 s
28
DSM-PM2: Architecture
• DSM comm:– send page request– send page– send invalidate request– …
• DSM page manager:– set/get page owner– set/get page access– add/remove to/from copyset– ...
DSM-PM2
MadeleineComms
MarcelThreads
DSM Protocol Policy
DSM Protocol lib
DSM Page Manager
DSM Comm
PM2
29
DSM Implementation
• Node-level caches.• Page-based and home-based protocol.• Use page faults to detect remote
objects.• Log modifications made to remote
objects.• Each node allocates objects from a
different range of the virtual address space.
30
Benchmarking
• Two Linux 2.2 clusters:– twelve 200 MHz Pentium Pro
processors connected by Myrinet switch and using BIP.
– six 450 MHz Pentium II processors connected by a SCI network and using SISCI.
• gcc 2.7.2.3 with -O6
31
Pi (50M intervals)
0
2
4
6
8
10
12
1 2 4 6 8 10 12
Nodes
Second
s 200MHz/ BIP450MHz/ SCI
32
Jacobi (1024x1024)
0
20
40
60
80
100
1 2 4 6 8 10 12
Nodes
Second
s 200MHz/ BIP450MHz/ SCI
33
Traveling Salesperson (17 cities)
0200400600800
100012001400
1 2 4 6 8 10 12
Nodes
Second
s 200MHz/ BIP450MHz/ SCI
34
All-pairs Shortest Path (2K nodes)
0
200
400
600
800
1000
1 2 4 6 8 10 12
Nodes
Second
s 200MHz/ BIP450MHz/ SCI
35
Barnes-Hut (16K bodies)
020406080
100120140
1 2 4 6 8 10 12
Nodes
Second
s 200MHz/ BIP450MHz/ SCI
36
Current Work
• Comparing Hyperion to mpiJava.• mpiJava is set of JNI wrappers to
MPI.• Using Java Grande Forum
benchmarks.• mpiJava implemented on top of
single-node version of Hyperion.• This controls for quality of bytecode
implementation.
37
Problems with JGF Benchmarks
• Written with SMP hardware in mind.– bogus synchronization.– all data allocated by one thread.
• SMP is not the right model!
38
An Alternative Model
• Programmer should be aware of the memory hierarchy.
• Do not require “magic” implementation.
• The thread is the correct level of abstraction:– If an object was created by a thread,
then the object is “near” the thread.– Otherwise the object might be “far” from
the thread.
39
Efficiency and Portability
• Will not hurt on SMP hardware and may even help.
• Implementation can be straightforward.
• “Magic” implementations also possible.
• Encourages portability across different implementations and hardware.
40
Other Lessons Learned
• in-line checks vs. page faults• network reactivity• System.arraycopy
41
In-line Checks vs. Page Faults
• Earlier version of Hyperion used in-line checks to detect remote objects.
• For our benchmarks, using page faults was always better.
• Local accesses are free.• Remote accesses are more
expensive.• But most accesses are local!
42
Network Reactivity
• Fetch of remote object implemented by asynchronous message to home node.
• Message handled by service thread on home node.
• When message arrives, service thread needs to be scheduled.
• Need integration of network layer and thread scheduler.
43
Short-term Solution
• Over-synchronize:– use BSP programming style– distinct phases for communication and
computation– phases separated by barrier
synchronization– so only service thread ready to run
during communication phase
44
System.arraycopy
• Native implementation can transmit data in units that are bigger than a page.
• Requires in-line check but usually amortized over large amount of data.
45
Conclusions
• Java threads is an attractive vehicle for parallel programming.
• Is Java serial execution fast enough?– Need true multi-dimensional arrays?
• Need clarified memory model.• Need extended thread API.• Programmers need to be aware of
memory hierarchy.