Message Passing On Tightly-Interconnected Multi-Core Processors

Message Passing On Tightly-Interconnected Multi-Core

Processors

James Psota and Anant AgarwalMIT CSAIL

Technology Scaling Enables Multi-Cores

Multi-cores offer a novel environment for parallel computing

cluster multi-core

Shared Memory– Shared caches or

memory– Remote DMA

(RDMA)

Traditional Communication On Multi-Processors

Interconnects– Ethernet TCP/IP– Myrinet– Scalable Coherent

Interconnect (SCI)

AMD Dual-Core OpteronBeowulf Cluster

On-Chip Networks Enable Fast Communication

• Some multi-cores offer…– tightly integrated on-chip

networks– direct access to hardware

resources (no OS layers)– fast interrupts

MIT Raw Processor used for experimentation and

validation

Parallel Programming is Hard

• Must orchestrate of computation and communication

• Extra resources present both opportunity and challenge

• Trivial to deadlock• Constraints on message sizes• No operating system support

rMPI’s Approach

Goals– robust, deadlock-free, scalable

programming interface– easy to program through high-level

routines

Challenge– exploit hardware resources for efficient

communication– don’t sacrifice performance

Outline

• Introduction• Background• Design• Results• Related Work

The Raw Multi-Core Processor

• 16 identical tiles– processing core– network routers

• 4 register-mapped on-chip networks

• Direct access to hardware resources

• Hardware fabricated in ASIC process

Raw Processor

Raw’s General Dynamic Network

• Handles run-time events – interrupts, dynamic messages

• Network guarantees atomic, in-order messages

• Dimension-ordered wormhole routed• Maximum message length: 31 words• Blocking sends/receives• Minimal network buffering

MPI: Portable Message Passing API

• Gives programmers high-level abstractions for parallel programming– send/receive, scatter/gather, reductions, etc.

• MPI is a standard, not an implementation– many implementations for many HW platforms– over 200 API functions

• MPI applications portable across MPI-compliant systems

• Can impose high overhead

MPI Semantics: Cooperative Communication

process 0

private address

space

process 1

private address

spacecommunication channel

• Data exchanged cooperatively via explicit send and receive

• Receiving process’s memory only modified with its explicit participation

• Combines communication and synchronization

send(dest=1, tag=17)

temp

tag=17

recv(src=0,

tag=42)

interrupt

send(dest=1, tag=42)

tag=42

recv(src=0,

tag=17)

interrupt

Outline


rMPI System Architecture

High-Level MPI Layer

• Argument checking (MPI semantics)

• Buffer prep

• Calls appropriate low level functions

• LAM/MPI partially ported

Collective Communications Layer

• Algorithms for collective operations– Broadcast– Scatter/Gather– Reduce

• Invokes low level functions

Point-to-Point Layer

• Low-level send/receive routines

• Highly optimized interrupt-driven receive design

• Packetization and reassembly

Outline


rMPI Evaluation

• How much overhead does high-level interface impose?– compare against hand-coded GDN

• Does it scale?– with problem size and number of

processors?– compare against hand-coded GDN– compare against commercial MPI

implementation on cluster

End-to-End Latency Overhead vs. Hand-Coded (1)

• Experiment measures latency for:– sender: load message from memory– sender: break up and send message– receiver: receive message– receiver: store message to memory

End-to-End Latency Overhead vs. Hand-Coded (2)

1 word: 481%

1000 words: 33%

packet management complexity overflows

cache

Performance Scaling: Jacobi

16x16 input matrix

2048 x 2048 input matrix

Performance Scaling: Jacobi, 16 processors

sequential version cache capacity overflow

sequential version cache capacity overflow

Overhead: Jacobi, rMPI vs. Hand-Coded

many small messages

16 tiles: 5% overhead

memory access synchronization

Matrix Multiplication: rMPI vs. LAM/MPI

many smaller messages; smaller message length has less effect on LAM

Trapezoidal Integration: rMPI vs. LAM/MPI

Pi Estimation: rMPI vs. LAM/MPI

Related Work

• Low-latency communication networks– iWarp, Alewife, INMOS

• Multi-core processors– VIRAM, Wavescalar, TRIPS, POWER 4,

Pentium D

• Alternatives to programming Raw– scalar operand network, CFlow, rawcc

• MPI implementations– OpenMPI, LAM/MPI, MPICH

Summary

• rMPI provides easy yet powerful programming model for multi-cores

• Scales better than commercial MPI implementation

• Low overhead over hand-coded applications

Thanks!

For more information, see Master’s Thesis:

http://cag.lcs.mit.edu/~jim/publications/ms.pdf

rMPI messages broken into packets

• Receiver buffers and demultiplexes packets from different sources

• Messages received upon interrupt, and buffered until user-level receive

rMPI

receiver

process

interrupt

rMPI sender

process 1

2

rMPI sender process 2

3

1

21

• GDN messages have a max length of 31 words

• rMPI packet format for 65 [payload] word MPI message

rMPI: enabling MPI programs on Raw

rMPI…• is compatible with current MPI software• gives programmers already familiar with

MPI an easy interface to program Raw• gives programmers fine-grain control over

their programs when trusting automatic parallelization tools are not adequate

• gives users a robust, deadlock-free, and high-performance programming model with which to program Raw

► easily write programs on Raw without overly sacrificing performance

Packet boundary bookkeeping

• Receiver must handle packet interleaving across multiple interrupt handler invocations

Receive-side packet management

• Global data structures accessed by interrupt handler and MPI Receive threads

• Data structure design minimizes pointer chasing for fast lookups

• No memcpy for receive-before-send case

User-thread CFG for receiving

Interrupt handler CFG

• logic supports MPI semantics and packet construction

Future work: improving performance

• Comparison of rMPI to standard cluster running off-the-shelf MPI library

• Improve system performance– further minimize MPI overhead– spatially-aware collective communication

algorithms– further Raw-specific optimizations

• Investigate new APIs better suited for TPAs

Future work: HW extensions

• Simple hardware tweaks may significantly improve performance– larger input/output FIFOs– simple switch logic/demultiplexing to

handle packetization could drastically simplify software logic

– larger header words (64 bit?) would allow for much larger (atomic) packets• (also, current header only scales to 32 x 32

tile fabrics)

Conclusions

• MPI standard was designed for “standard” parallel machines, not for tiled architectures– MPI may no longer make sense for tiled

designs

• Simple hardware could significantly reduce packet management overhead increase rMPI performance

Message Passing On Tightly-Interconnected Multi-Core Processors

Documents

Transcript of Message Passing On Tightly-Interconnected Multi-Core Processors