Message Passing On Tightly-Interconnected Multi-Core Processors
description
Transcript of Message Passing On Tightly-Interconnected Multi-Core Processors
Message Passing On Tightly-Interconnected Multi-Core
Processors
James Psota and Anant AgarwalMIT CSAIL
Technology Scaling Enables Multi-Cores
Multi-cores offer a novel environment for parallel computing
cluster multi-core
Shared Memory– Shared caches or
memory– Remote DMA
(RDMA)
Traditional Communication On Multi-Processors
Interconnects– Ethernet TCP/IP– Myrinet– Scalable Coherent
Interconnect (SCI)
AMD Dual-Core OpteronBeowulf Cluster
On-Chip Networks Enable Fast Communication
• Some multi-cores offer…– tightly integrated on-chip
networks– direct access to hardware
resources (no OS layers)– fast interrupts
MIT Raw Processor used for experimentation and
validation
Parallel Programming is Hard
• Must orchestrate of computation and communication
• Extra resources present both opportunity and challenge
• Trivial to deadlock• Constraints on message sizes• No operating system support
rMPI’s Approach
Goals– robust, deadlock-free, scalable
programming interface– easy to program through high-level
routines
Challenge– exploit hardware resources for efficient
communication– don’t sacrifice performance
Outline
• Introduction• Background• Design• Results• Related Work
The Raw Multi-Core Processor
• 16 identical tiles– processing core– network routers
• 4 register-mapped on-chip networks
• Direct access to hardware resources
• Hardware fabricated in ASIC process
Raw Processor
Raw’s General Dynamic Network
• Handles run-time events – interrupts, dynamic messages
• Network guarantees atomic, in-order messages
• Dimension-ordered wormhole routed• Maximum message length: 31 words• Blocking sends/receives• Minimal network buffering
MPI: Portable Message Passing API
• Gives programmers high-level abstractions for parallel programming– send/receive, scatter/gather, reductions, etc.
• MPI is a standard, not an implementation– many implementations for many HW platforms– over 200 API functions
• MPI applications portable across MPI-compliant systems
• Can impose high overhead
MPI Semantics: Cooperative Communication
process 0
private address
space
process 1
private address
spacecommunication channel
• Data exchanged cooperatively via explicit send and receive
• Receiving process’s memory only modified with its explicit participation
• Combines communication and synchronization
send(dest=1, tag=17)
temp
tag=17
recv(src=0,
tag=42)
interrupt
send(dest=1, tag=42)
tag=42
recv(src=0,
tag=17)
interrupt
Outline
• Introduction• Background• Design• Results• Related Work
rMPI System Architecture
High-Level MPI Layer
• Argument checking (MPI semantics)
• Buffer prep
• Calls appropriate low level functions
• LAM/MPI partially ported
Collective Communications Layer
• Algorithms for collective operations– Broadcast– Scatter/Gather– Reduce
• Invokes low level functions
Point-to-Point Layer
• Low-level send/receive routines
• Highly optimized interrupt-driven receive design
• Packetization and reassembly
Outline
• Introduction• Background• Design• Results• Related Work
rMPI Evaluation
• How much overhead does high-level interface impose?– compare against hand-coded GDN
• Does it scale?– with problem size and number of
processors?– compare against hand-coded GDN– compare against commercial MPI
implementation on cluster
End-to-End Latency Overhead vs. Hand-Coded (1)
• Experiment measures latency for:– sender: load message from memory– sender: break up and send message– receiver: receive message– receiver: store message to memory
End-to-End Latency Overhead vs. Hand-Coded (2)
1 word: 481%
1000 words: 33%
packet management complexity overflows
cache
Performance Scaling: Jacobi
16x16 input matrix
2048 x 2048 input matrix
Performance Scaling: Jacobi, 16 processors
sequential version cache capacity overflow
sequential version cache capacity overflow
Overhead: Jacobi, rMPI vs. Hand-Coded
many small messages
16 tiles: 5% overhead
memory access synchronization
Matrix Multiplication: rMPI vs. LAM/MPI
many smaller messages; smaller message length has less effect on LAM
Trapezoidal Integration: rMPI vs. LAM/MPI
Pi Estimation: rMPI vs. LAM/MPI
Related Work
• Low-latency communication networks– iWarp, Alewife, INMOS
• Multi-core processors– VIRAM, Wavescalar, TRIPS, POWER 4,
Pentium D
• Alternatives to programming Raw– scalar operand network, CFlow, rawcc
• MPI implementations– OpenMPI, LAM/MPI, MPICH
Summary
• rMPI provides easy yet powerful programming model for multi-cores
• Scales better than commercial MPI implementation
• Low overhead over hand-coded applications
Thanks!
For more information, see Master’s Thesis:
http://cag.lcs.mit.edu/~jim/publications/ms.pdf
rMPI messages broken into packets
• Receiver buffers and demultiplexes packets from different sources
• Messages received upon interrupt, and buffered until user-level receive
rMPI
receiver
process
interrupt
rMPI sender
process 1
2
rMPI sender process 2
3
1
21
• GDN messages have a max length of 31 words
• rMPI packet format for 65 [payload] word MPI message
rMPI: enabling MPI programs on Raw
rMPI…• is compatible with current MPI software• gives programmers already familiar with
MPI an easy interface to program Raw• gives programmers fine-grain control over
their programs when trusting automatic parallelization tools are not adequate
• gives users a robust, deadlock-free, and high-performance programming model with which to program Raw
► easily write programs on Raw without overly sacrificing performance
Packet boundary bookkeeping
• Receiver must handle packet interleaving across multiple interrupt handler invocations
Receive-side packet management
• Global data structures accessed by interrupt handler and MPI Receive threads
• Data structure design minimizes pointer chasing for fast lookups
• No memcpy for receive-before-send case
User-thread CFG for receiving
Interrupt handler CFG
• logic supports MPI semantics and packet construction
Future work: improving performance
• Comparison of rMPI to standard cluster running off-the-shelf MPI library
• Improve system performance– further minimize MPI overhead– spatially-aware collective communication
algorithms– further Raw-specific optimizations
• Investigate new APIs better suited for TPAs
Future work: HW extensions
• Simple hardware tweaks may significantly improve performance– larger input/output FIFOs– simple switch logic/demultiplexing to
handle packetization could drastically simplify software logic
– larger header words (64 bit?) would allow for much larger (atomic) packets• (also, current header only scales to 32 x 32
tile fabrics)
Conclusions
• MPI standard was designed for “standard” parallel machines, not for tiled architectures– MPI may no longer make sense for tiled
designs
• Simple hardware could significantly reduce packet management overhead increase rMPI performance