Post on 04-Jan-2016
Distributed Galois
Andrew Lenharth2/27/2015
Goals
• An implementation of the operator formulation for distributed memory– Ideally forward-compatible where possible
• Both simple programming model and fast implementation– Like Galois, may need restrictions or structure for
highest performance
Overview
• PGAS (using fat pointers)• Implicit, asynchronous communication• Default execution mode:– Galois compatable– Implicit locking and data movement– Plugable schedulers– Speculative execution
• All D-Galois programs are valid Galois
Support
Galois Implementation
User Code
User Context
Graph
Parallel Loop
Contention Manager
Memory Management
Statistics
Topology
Scheduler
Barrier
Termination Etc
Support
Distributed Galois Implementation
User Code
User Context
Graph
Parallel Loop
Contention Manager
Memory Management
Statistics
Topology
Scheduler
Barrier
Termination Etc
NetworkDirectoryRemote Store
Current Status
• Working implementation of baseline– Asynchronous, speculative
Interesting Problems
• Livelock• Asynchronous directory• Abstractions for building data-structures• Network hardware• Network software• Remote updates• Scheduling
Solved: Livelock
• Source: object state transition is more complex, is asynchronous, and may require multiple steps (hence interruptable)
• Solution: scheme to ensure forward progress of one host
• Alternate: if this happens a lot for your application, a coordinated scheduling may be more appropriate (or relaxed consistency)
Asynchronous Directory
• Source: communication and workers interleave access to directory (and directly to objects stored in the directory)
• Solution: mostly just a pain.
Abstraction for building DS
• Source: Distributed data structures are hard (so are SM DS).
• Solution: Set of abstractions• Federated object: different instance on each
host/thread, pointers resolve locally.• Federation bootstrapped by runtime.• Federated objects don’t have any notion of
exclusive behavior
Remote Updates
• Directory synchronization really bad when not needed (essential when needed)
• Many algorithms have an update and schedule behavior for their neighbors
• Treat this behavior as a task type– Multiple task-types per loop– Quite similar to nested parallelism
Remote Updates – PageRank
Self.value += self.residualFor n : neighbor
n.residual += f(self.residual)Schedule (operator type on) {n}
Self.value += self.residualFor n : neighbor
Schedule (update type on) {n, f(self.residual)}
With a new operator:Self.redual += updateSchedule (operator type on) {self}
Scheduling
• Source: Imagine SSSP using the existing schedulers (host-unaware) on distributed memory
• Need schedule with way to anchor work to data-structure element
Network hardware
Networks
• Small asynchronous messages are bad for throughput
• Scale-free graphs stress throughput• Large messages are bad for latency• Find optimal point– Sometimes latency is critical
Nagle’s algorithm
• If you don’t have a large message, wait a while to get more data
• Bad for latency• Also, keeps MPI in it’s broken behavior range• Also, requires O(P) memory for
communications (assuming direct pointwise)
Communication pattern
Communication pattern
Software Routing
• Pros: single communication channel– Scales with hosts– Aggregates all messages
• Cons: 2 hops (or more)