Message Passing On Tightly-Interconnected Multi-Core Processors

39
Message Passing On Tightly- Interconnected Multi-Core Processors James Psota and Anant Agarwal MIT CSAIL

description

Message Passing On Tightly-Interconnected Multi-Core Processors. James Psota and Anant Agarwal MIT CSAIL. Technology Scaling Enables Multi-Cores. Multi-cores offer a novel environment for parallel computing. cluster. multi-core. Traditional Communication On Multi-Processors. Interconnects - PowerPoint PPT Presentation

Transcript of Message Passing On Tightly-Interconnected Multi-Core Processors

Page 1: Message Passing On Tightly-Interconnected Multi-Core Processors

Message Passing On Tightly-Interconnected Multi-Core

Processors

James Psota and Anant AgarwalMIT CSAIL

Page 2: Message Passing On Tightly-Interconnected Multi-Core Processors

Technology Scaling Enables Multi-Cores

Multi-cores offer a novel environment for parallel computing

cluster multi-core

Page 3: Message Passing On Tightly-Interconnected Multi-Core Processors

Shared Memory– Shared caches or

memory– Remote DMA

(RDMA)

Traditional Communication On Multi-Processors

Interconnects– Ethernet TCP/IP– Myrinet– Scalable Coherent

Interconnect (SCI)

AMD Dual-Core OpteronBeowulf Cluster

Page 4: Message Passing On Tightly-Interconnected Multi-Core Processors

On-Chip Networks Enable Fast Communication

• Some multi-cores offer…– tightly integrated on-chip

networks– direct access to hardware

resources (no OS layers)– fast interrupts

MIT Raw Processor used for experimentation and

validation

Page 5: Message Passing On Tightly-Interconnected Multi-Core Processors

Parallel Programming is Hard

• Must orchestrate of computation and communication

• Extra resources present both opportunity and challenge

• Trivial to deadlock• Constraints on message sizes• No operating system support

Page 6: Message Passing On Tightly-Interconnected Multi-Core Processors

rMPI’s Approach

Goals– robust, deadlock-free, scalable

programming interface– easy to program through high-level

routines

Challenge– exploit hardware resources for efficient

communication– don’t sacrifice performance

Page 7: Message Passing On Tightly-Interconnected Multi-Core Processors

Outline

• Introduction• Background• Design• Results• Related Work

Page 8: Message Passing On Tightly-Interconnected Multi-Core Processors

The Raw Multi-Core Processor

• 16 identical tiles– processing core– network routers

• 4 register-mapped on-chip networks

• Direct access to hardware resources

• Hardware fabricated in ASIC process

Raw Processor

Page 9: Message Passing On Tightly-Interconnected Multi-Core Processors

Raw’s General Dynamic Network

• Handles run-time events – interrupts, dynamic messages

• Network guarantees atomic, in-order messages

• Dimension-ordered wormhole routed• Maximum message length: 31 words• Blocking sends/receives• Minimal network buffering

Page 10: Message Passing On Tightly-Interconnected Multi-Core Processors

MPI: Portable Message Passing API

• Gives programmers high-level abstractions for parallel programming– send/receive, scatter/gather, reductions, etc.

• MPI is a standard, not an implementation– many implementations for many HW platforms– over 200 API functions

• MPI applications portable across MPI-compliant systems

• Can impose high overhead

Page 11: Message Passing On Tightly-Interconnected Multi-Core Processors

MPI Semantics: Cooperative Communication

process 0

private address

space

process 1

private address

spacecommunication channel

• Data exchanged cooperatively via explicit send and receive

• Receiving process’s memory only modified with its explicit participation

• Combines communication and synchronization

send(dest=1, tag=17)

temp

tag=17

recv(src=0,

tag=42)

interrupt

send(dest=1, tag=42)

tag=42

recv(src=0,

tag=17)

interrupt

Page 12: Message Passing On Tightly-Interconnected Multi-Core Processors

Outline

• Introduction• Background• Design• Results• Related Work

Page 13: Message Passing On Tightly-Interconnected Multi-Core Processors

rMPI System Architecture

Page 14: Message Passing On Tightly-Interconnected Multi-Core Processors

High-Level MPI Layer

• Argument checking (MPI semantics)

• Buffer prep

• Calls appropriate low level functions

• LAM/MPI partially ported

Page 15: Message Passing On Tightly-Interconnected Multi-Core Processors

Collective Communications Layer

• Algorithms for collective operations– Broadcast– Scatter/Gather– Reduce

• Invokes low level functions

Page 16: Message Passing On Tightly-Interconnected Multi-Core Processors

Point-to-Point Layer

• Low-level send/receive routines

• Highly optimized interrupt-driven receive design

• Packetization and reassembly

Page 17: Message Passing On Tightly-Interconnected Multi-Core Processors

Outline

• Introduction• Background• Design• Results• Related Work

Page 18: Message Passing On Tightly-Interconnected Multi-Core Processors

rMPI Evaluation

• How much overhead does high-level interface impose?– compare against hand-coded GDN

• Does it scale?– with problem size and number of

processors?– compare against hand-coded GDN– compare against commercial MPI

implementation on cluster

Page 19: Message Passing On Tightly-Interconnected Multi-Core Processors

End-to-End Latency Overhead vs. Hand-Coded (1)

• Experiment measures latency for:– sender: load message from memory– sender: break up and send message– receiver: receive message– receiver: store message to memory

Page 20: Message Passing On Tightly-Interconnected Multi-Core Processors

End-to-End Latency Overhead vs. Hand-Coded (2)

1 word: 481%

1000 words: 33%

packet management complexity overflows

cache

Page 21: Message Passing On Tightly-Interconnected Multi-Core Processors

Performance Scaling: Jacobi

16x16 input matrix

2048 x 2048 input matrix

Page 22: Message Passing On Tightly-Interconnected Multi-Core Processors

Performance Scaling: Jacobi, 16 processors

sequential version cache capacity overflow

sequential version cache capacity overflow

Page 23: Message Passing On Tightly-Interconnected Multi-Core Processors

Overhead: Jacobi, rMPI vs. Hand-Coded

many small messages

16 tiles: 5% overhead

memory access synchronization

Page 24: Message Passing On Tightly-Interconnected Multi-Core Processors

Matrix Multiplication: rMPI vs. LAM/MPI

many smaller messages; smaller message length has less effect on LAM

Page 25: Message Passing On Tightly-Interconnected Multi-Core Processors

Trapezoidal Integration: rMPI vs. LAM/MPI

Page 26: Message Passing On Tightly-Interconnected Multi-Core Processors

Pi Estimation: rMPI vs. LAM/MPI

Page 27: Message Passing On Tightly-Interconnected Multi-Core Processors

Related Work

• Low-latency communication networks– iWarp, Alewife, INMOS

• Multi-core processors– VIRAM, Wavescalar, TRIPS, POWER 4,

Pentium D

• Alternatives to programming Raw– scalar operand network, CFlow, rawcc

• MPI implementations– OpenMPI, LAM/MPI, MPICH

Page 28: Message Passing On Tightly-Interconnected Multi-Core Processors

Summary

• rMPI provides easy yet powerful programming model for multi-cores

• Scales better than commercial MPI implementation

• Low overhead over hand-coded applications

Page 29: Message Passing On Tightly-Interconnected Multi-Core Processors

Thanks!

For more information, see Master’s Thesis:

http://cag.lcs.mit.edu/~jim/publications/ms.pdf

Page 30: Message Passing On Tightly-Interconnected Multi-Core Processors
Page 31: Message Passing On Tightly-Interconnected Multi-Core Processors

rMPI messages broken into packets

• Receiver buffers and demultiplexes packets from different sources

• Messages received upon interrupt, and buffered until user-level receive

rMPI

receiver

process

interrupt

rMPI sender

process 1

2

rMPI sender process 2

3

1

21

• GDN messages have a max length of 31 words

• rMPI packet format for 65 [payload] word MPI message

Page 32: Message Passing On Tightly-Interconnected Multi-Core Processors

rMPI: enabling MPI programs on Raw

rMPI…• is compatible with current MPI software• gives programmers already familiar with

MPI an easy interface to program Raw• gives programmers fine-grain control over

their programs when trusting automatic parallelization tools are not adequate

• gives users a robust, deadlock-free, and high-performance programming model with which to program Raw

► easily write programs on Raw without overly sacrificing performance

Page 33: Message Passing On Tightly-Interconnected Multi-Core Processors

Packet boundary bookkeeping

• Receiver must handle packet interleaving across multiple interrupt handler invocations

Page 34: Message Passing On Tightly-Interconnected Multi-Core Processors

Receive-side packet management

• Global data structures accessed by interrupt handler and MPI Receive threads

• Data structure design minimizes pointer chasing for fast lookups

• No memcpy for receive-before-send case

Page 35: Message Passing On Tightly-Interconnected Multi-Core Processors

User-thread CFG for receiving

Page 36: Message Passing On Tightly-Interconnected Multi-Core Processors

Interrupt handler CFG

• logic supports MPI semantics and packet construction

Page 37: Message Passing On Tightly-Interconnected Multi-Core Processors

Future work: improving performance

• Comparison of rMPI to standard cluster running off-the-shelf MPI library

• Improve system performance– further minimize MPI overhead– spatially-aware collective communication

algorithms– further Raw-specific optimizations

• Investigate new APIs better suited for TPAs

Page 38: Message Passing On Tightly-Interconnected Multi-Core Processors

Future work: HW extensions

• Simple hardware tweaks may significantly improve performance– larger input/output FIFOs– simple switch logic/demultiplexing to

handle packetization could drastically simplify software logic

– larger header words (64 bit?) would allow for much larger (atomic) packets• (also, current header only scales to 32 x 32

tile fabrics)

Page 39: Message Passing On Tightly-Interconnected Multi-Core Processors

Conclusions

• MPI standard was designed for “standard” parallel machines, not for tiled architectures– MPI may no longer make sense for tiled

designs

• Simple hardware could significantly reduce packet management overhead increase rMPI performance