Basic message passing benchmarks, methodology and pitfalls
description
Transcript of Basic message passing benchmarks, methodology and pitfalls
![Page 1: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/1.jpg)
Basic message passing benchmarks, methodology and pitfalls
Basic message passing benchmarks, methodology and pitfalls
Further info.: http://www.ccrl-nece.technopark.gmd.deFurther info.: http://www.ccrl-nece.technopark.gmd.de
Rolf Hempel
C & C Research Laboratories, NEC Europe Ltd.Rathausallee 1053757 Sankt AugustinGermanyemail: [email protected]
Rolf Hempel
C & C Research Laboratories, NEC Europe Ltd.Rathausallee 1053757 Sankt AugustinGermanyemail: [email protected]
SPEC Workshop, Wuppertal, Sept. 13, 1999SPEC Workshop, Wuppertal, Sept. 13, 1999
![Page 2: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/2.jpg)
Rolf Hempel, NEC Europe Ltd.
NEC R&D Group Overseas FacilitiesNEC R&D Group Overseas Facilities
C&C Research LaboratoriesNEC Europe Ltd.(Bonn, Germany)
C&C Research LaboratoriesNEC Europe Ltd.(Bonn, Germany)
JapanJapan
NEC Research Institute, Inc.(Princeton, U.S.A.)
C&C Research Laboratories, NEC U.S.A., Inc.(Princeton, San Jose, U.S.A.)
NEC Research Institute, Inc.(Princeton, U.S.A.)
C&C Research Laboratories, NEC U.S.A., Inc.(Princeton, San Jose, U.S.A.)
![Page 3: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/3.jpg)
Rolf Hempel, NEC Europe Ltd.
MPI Implementations at NEC CCRLE:MPI Implementations at NEC CCRLE:
Cenju-4
MPI for Cenjuderived fromMPI/SX LAMP
MPI forPC cluster
Image courtesy of National Space Development Agency of Japan /
Japan Atomic Energy Research Institute
Earth Simulator
MPI design forEarth Simulator
SX-5
MPI/SX product development(currently for SX-4/5)
SX-5M
![Page 4: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/4.jpg)
Rolf Hempel, NEC Europe Ltd.
Most important MPI implementation: MPI/SXMost important MPI implementation: MPI/SX
SX-5 (in the past SX-4):Commercial product of NEC,Parallel vector supercomputer
SX-5 (in the past SX-4):Commercial product of NEC,Parallel vector supercomputer
Since 1997: MPI/SX product development at CCRLEStandard-compliant, fully tested interfaceOptimized MPI-1, MPI-2 almost finishedMaintenance & Customer support
Since 1997: MPI/SX product development at CCRLEStandard-compliant, fully tested interfaceOptimized MPI-1, MPI-2 almost finishedMaintenance & Customer support
![Page 5: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/5.jpg)
Rolf Hempel, NEC Europe Ltd.
Image courtesy of National Space Development Agency of Japan /
Japan Atomic Energy Research institute
MPI design for the Earth SimulatorMPI design for the Earth Simulator
Massively parallel computer:Thousands of processors
Massively parallel computer:Thousands of processors
MPI is the basis for alllarge-scale applications
MPI is the basis for alllarge-scale applications
Design and implementationat CCRLE
Design and implementationat CCRLE
At the moment: design in collaboration with OS group at NEC FuchuAt the moment: design in collaboration with OS group at NEC Fuchu
![Page 6: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/6.jpg)
Rolf Hempel, NEC Europe Ltd.
The purpose of message-passing benchmarksThe purpose of message-passing benchmarks
Goal: measure the time needed for elementary operations(such as send/receive)
Goal: measure the time needed for elementary operations(such as send/receive)
Behavior of application programs can be modeledBehavior of application programs can be modeled
Problem: difficult to measure time for single operation:no global clockclock resolution too lowdifficult to differentiate: receive time synchronization delay
Problem: difficult to measure time for single operation:no global clockclock resolution too lowdifficult to differentiate: receive time synchronization delay
![Page 7: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/7.jpg)
Rolf Hempel, NEC Europe Ltd.
Standard solution: measure loops over communication operationsStandard solution: measure loops over communication operations
Simple example: the ping-pong benchmark:Simple example: the ping-pong benchmark:
Process 0 Process 1
Send to 1 Recv from 0
Send to 0Recv from 1
Iterate 1000 timesmeasure time Tfor entire loop
Iterate 1000 timesmeasure time Tfor entire loop
Result: Time for single message = T / 2000 Result: Time for single message = T / 2000
![Page 8: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/8.jpg)
Rolf Hempel, NEC Europe Ltd.
Implicit assumption: The time for a message in an application codewill be similar to the benchmark result
Implicit assumption: The time for a message in an application codewill be similar to the benchmark result
Why is this usually not the case?Why is this usually not the case?
1. Receiver in ping-pong is always ready to receivereceive in ‘solicited message’ mode
delays or intermediate copies can be avoided
1. Receiver in ping-pong is always ready to receivereceive in ‘solicited message’ mode
delays or intermediate copies can be avoided
2. Only two processes are activeno contention on interconnect system, non-scaling effects are not visible (e.g. locks on global data structures)
2. Only two processes are activeno contention on interconnect system, non-scaling effects are not visible (e.g. locks on global data structures)
![Page 9: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/9.jpg)
Rolf Hempel, NEC Europe Ltd.
3. Untypical response to different progress concepts:3. Untypical response to different progress concepts:
Single-threadedMPI implementation:
Single-threadedMPI implementation:
Application process
Call to MPI_Recv
MPI library:check for andservice outstandingrequests
Multi-threadedMPI implementation:
Multi-threadedMPI implementation:
Application thread
Call to MPI_Recv
MPIprocess
MPIprocessCommunication thread
constantly check forand service comm.requests
![Page 10: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/10.jpg)
Rolf Hempel, NEC Europe Ltd.
Single-threaded MPI:progress at receiver onlywhen MPI routine is called
Single-threaded MPI:progress at receiver onlywhen MPI routine is called
Multi-threaded MPI:progress at receiver independentof application thread
Multi-threaded MPI:progress at receiver independentof application thread
Bad sender/receiversynchronization bad progress
Bad sender/receiversynchronization bad progress
Communication progress independentof sender/receiver synchronization
Communication progress independentof sender/receiver synchronization
This advantage is not seen in theping-pong benchmark!single-threaded always better
This advantage is not seen in theping-pong benchmark!single-threaded always better
(Comparison of ping-pong latencyon Myrinet PC cluster)
(Comparison of ping-pong latencyon Myrinet PC cluster)
![Page 11: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/11.jpg)
Rolf Hempel, NEC Europe Ltd.
4. Data may be cached between loop iterations:4. Data may be cached between loop iterations:
cache
MPI processactive in ping-pong
MPI_Recv
MPI_Send
Much higher bandwidththan in real applications
Much higher bandwidththan in real applications
Comparison of ping-pong bandwidthfor Myrinet PC cluster cached versus non-cached data
Comparison of ping-pong bandwidthfor Myrinet PC cluster cached versus non-cached data
![Page 12: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/12.jpg)
Rolf Hempel, NEC Europe Ltd.
This can be even worse on CC-NUMA architectures (e.g. SGI Origin 2000):This can be even worse on CC-NUMA architectures (e.g. SGI Origin 2000):
Operation to be timed: call shmem_get8 (b(1), a(1), n, myrank-1)Operation to be timed: call shmem_get8 (b(1), a(1), n, myrank-1)
Processor myrank
b
Cache
Processor myrank-1
a
A(1)A(2)
A(n)
First transfer only
First transfer only
All transfersAll transfers
Consequence:All but first transfersare VERY fast!
Consequence:All but first transfersare VERY fast!
![Page 13: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/13.jpg)
Rolf Hempel, NEC Europe Ltd.
The problem of parametrizationThe problem of parametrization
Goal: Instead of displaying a performance graph,define a few parameters that characterizethe communication behavior
Goal: Instead of displaying a performance graph,define a few parameters that characterizethe communication behavior
Most popularParameters:
Latency: time for 0 byte messageBandwidth: asymptotic throughput for long messages
Most popularParameters:
Latency: time for 0 byte messageBandwidth: asymptotic throughput for long messages
![Page 14: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/14.jpg)
Rolf Hempel, NEC Europe Ltd.
Problem: This model assumes a very simple communication protocol It is not consistent with most MPI implementations
Example: MPI on the NEC Cenju-4 (MPP system, R10000 processors)
Problem: This model assumes a very simple communication protocol It is not consistent with most MPI implementations
Example: MPI on the NEC Cenju-4 (MPP system, R10000 processors)Blu
e fitt
ing:
lat
ency
= 8
.5
sec
Red fitting: la
tency = 24 sec
Something seems tochange atmessage size = 1024 bytes
Something seems tochange atmessage size = 1024 bytes
![Page 15: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/15.jpg)
Rolf Hempel, NEC Europe Ltd.
It can be even worse: Discontinuities in timing curve
Example: NEC SX-4, send / receive within 32 processor node
It can be even worse: Discontinuities in timing curve
Example: NEC SX-4, send / receive within 32 processor node
![Page 16: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/16.jpg)
Rolf Hempel, NEC Europe Ltd.
In most modern MPI implementations: Three different protocols,depending on message length
In most modern MPI implementations: Three different protocols,depending on message length
Short messagesShort messagesSender process Receiver process
Pre-defined slotsMessage envelope + data
Get data whenreceiver ready
Get data whenreceiver ready
‘’Long’’ messages‘’Long’’ messagesSender process
Receiver process
Pre-defined slotsMessage envelope
Synchronization Synchronization
Local copy of message data
‘’Eager’’ messages‘’Eager’’ messagesSender process Receiver process
Pre-defined slotsMessage envelope
User memoryReceive buffer
Send buffer Receive buffer
![Page 17: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/17.jpg)
Rolf Hempel, NEC Europe Ltd.
Good reasons for protocol changes:Good reasons for protocol changes:
Short messages: Copying does not cost much Important not to block the sender for too long Use pre-allocated slots for intermediate copy avoid time to allocate memory
‘‘Eager’’ messages: Copying does not cost much Important not to block the sender for too long Allocate intermediate buffer on the fly
Long messages: Copying is expensive, avoid intermediate copiesRendezvous between sender and receiverSynchronous data transfer optimal throughput
In SX-5 implementation: Protocol changes at 1024 and 100,000 Bytes
![Page 18: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/18.jpg)
Rolf Hempel, NEC Europe Ltd.
Solution: We made our MPI SLOWER for long messages tilt of fitting line much better latency
Solution: We made our MPI SLOWER for long messages tilt of fitting line much better latency
Protocol changes are dangerous for curve fittings!Protocol changes are dangerous for curve fittings!
Famous example: The COMMS1 - COMMS3 benchmarks
Procurement at NEC: A Customer requested a certain message latency,to be measured with COMMS1
Problem: Simplistic curve fitting in COMMS1 SX-4 latency was too high
Famous example: The COMMS1 - COMMS3 benchmarks
Procurement at NEC: A Customer requested a certain message latency,to be measured with COMMS1
Problem: Simplistic curve fitting in COMMS1 SX-4 latency was too high
a happy customer!
![Page 19: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/19.jpg)
Rolf Hempel, NEC Europe Ltd.
The problem of averagingThe problem of averaging
Problem: difficult to measure time for single MPI operation:no global clockclock resolution too lowdifficult to differentiate:
receive time synchronization delay
Problem: difficult to measure time for single MPI operation:no global clockclock resolution too lowdifficult to differentiate:
receive time synchronization delay
Solution: Averaging over many iterations
Implicit assumption: All iterations take the same timeAveraging increases
accuracy
This is not always true!
Solution: Averaging over many iterations
Implicit assumption: All iterations take the same timeAveraging increases
accuracy
This is not always true!
![Page 20: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/20.jpg)
Rolf Hempel, NEC Europe Ltd.
Example: Remote get operation shmem_get8 on Cray T3EExample: Remote get operation shmem_get8 on Cray T3E
Experiment with message size of 10 Kbytes: 100 measurement runs (tests) In each test, ITER calls to shmem_get8 in a loop For each test, time divided by ITER Time for a single operation
The graph on the next slide shows the results for 100 tests.
Experiment with message size of 10 Kbytes: 100 measurement runs (tests) In each test, ITER calls to shmem_get8 in a loop For each test, time divided by ITER Time for a single operation
The graph on the next slide shows the results for 100 tests.
The reported results are courtesy Prof. Glenn LueckeIowa State UniversityAmes, IowaU.S.A.
The reported results are courtesy Prof. Glenn LueckeIowa State UniversityAmes, IowaU.S.A.
![Page 21: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/21.jpg)
Rolf Hempel, NEC Europe Ltd.
There are 7 huge spikes in the graph.Probable reason: operating system interruptions.Should those tests be included in the average time?
There are 7 huge spikes in the graph.Probable reason: operating system interruptions.Should those tests be included in the average time?
![Page 22: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/22.jpg)
Rolf Hempel, NEC Europe Ltd.
Real*8 a, b, t1, t2
t1 = MPI_WTIME ()Do i = 1, 1000 call MPI_REDUCE (a, b, 1, MPI_DOUBLE_PRECISION,
+ MPI_MAX, 0, MPI_COMM_WORLD, IERROR)end dot2 = MPI_WTIME ()
Real*8 a, b, t1, t2
t1 = MPI_WTIME ()Do i = 1, 1000 call MPI_REDUCE (a, b, 1, MPI_DOUBLE_PRECISION,
+ MPI_MAX, 0, MPI_COMM_WORLD, IERROR)end dot2 = MPI_WTIME ()
Let’s compare the behavior of two MPI implementations:Let’s compare the behavior of two MPI implementations:
Test setup: Root process plus six child processes
Reduction on an 8 byte real Time is measured for 1000
iterations
Test setup: Root process plus six child processes
Reduction on an 8 byte real Time is measured for 1000
iterations
A more subtle example: MPI_ReduceA more subtle example: MPI_Reduce
![Page 23: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/23.jpg)
Second implementation, details:
Synchronous algorithm Senders are blocked, until the last process is ready High synchronization delay
relatively bad algorithm
Second implementation, details:
Synchronous algorithm Senders are blocked, until the last process is ready High synchronization delay
relatively bad algorithm
First implementation, details:
Asynchronous algorithm Senders are not blocked, independent of receiver status Low synchronization delay
relatively good
algorithm
First implementation, details:
Asynchronous algorithm Senders are not blocked, independent of receiver status Low synchronization delay
relatively good
algorithm
2. Step: perform reduction as infirst algorithm
1. Step: synchronize all procs
root = receiver
senders
root = receiver
senders
![Page 24: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/24.jpg)
Rolf Hempel, NEC Europe Ltd.
What happens at runtime?What happens at runtime?
root = receiver
senders
For the first MPI implementation:
Leaf processes are fastest, and arenot blocked in single operation They quickly go through loopMessages pile up at intermediate processors Increasing message queues slow down those processors At some point flow-control takes over, but progress is slow
For the first MPI implementation:
Leaf processes are fastest, and arenot blocked in single operation They quickly go through loopMessages pile up at intermediate processors Increasing message queues slow down those processors At some point flow-control takes over, but progress is slow
Important to remember: We want to measure the time for ONE reduction operation. We only use loop averaging to increase the accuracy.
Important to remember: We want to measure the time for ONE reduction operation. We only use loop averaging to increase the accuracy.
This does not happen with the second (bad) MPI implementation!
The unnecessary synchronization keeps messages from piling up.
This does not happen with the second (bad) MPI implementation!
The unnecessary synchronization keeps messages from piling up.
![Page 25: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/25.jpg)
Rolf Hempel, NEC Europe Ltd.
Compound BenchmarksCompound Benchmarks
MPI_Send / MPI_RecvMPI_Send / MPI_Recv
MPI_Bsend / MPI_RecvMPI_Bsend / MPI_Recv
MPI_Isend / MPI_RecvMPI_Isend / MPI_Recv
MPI_ReduceMPI_Reduce
MPI_BcastMPI_Bcast
Benchmark driver:
user interfacecall individual componentsreport timings / statisticscompute effective benchmark parameters
Benchmark suite executed in a single runElegant and flexible to use
Benchmark suite executed in a single runElegant and flexible to use
Can there be any problems? ...... Guess what!Can there be any problems? ...... Guess what!
![Page 26: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/26.jpg)
Rolf Hempel, NEC Europe Ltd.
In a parallel program: Good idea not to print results in every process(or the ordering is up to chance)
Alternative: Send messages to process 0,Do all the printing there
In a parallel program: Good idea not to print results in every process(or the ordering is up to chance)
Alternative: Send messages to process 0,Do all the printing there
What if the active process sets change between benchmark components?What if the active process sets change between benchmark components?
MPI_Send / MPI_RecvMPI_Send / MPI_Recv
MPI_Bsend / MPI_RecvMPI_Bsend / MPI_Recv
MPI_Isend / MPI_RecvMPI_Isend / MPI_Recv
MPI_ReduceMPI_Reduce
MPI_BcastMPI_Bcast
![Page 27: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/27.jpg)
Rolf Hempel, NEC Europe Ltd.
We have seen a public benchmark suite in which:
In every phase, every process sends a message to process 0
Timing results for this phase Message ‘‘I am not active in this phase’’
At the same time: process 0 is active in the ping-pong with process 1
Result: Process 0 gets bombarded with ‘‘unsolicited messages’’ during the ping-pong.
We have seen a public benchmark suite in which:
In every phase, every process sends a message to process 0
Timing results for this phase Message ‘‘I am not active in this phase’’
At the same time: process 0 is active in the ping-pong with process 1
Result: Process 0 gets bombarded with ‘‘unsolicited messages’’ during the ping-pong.
Good MPI implementation: handle incoming comm. Requestsas early as possible good progress
Bad MPI implementation: handle ‘‘unsolicited messages’’ only ifnothing else is to be done bad progress
Good MPI implementation: handle incoming comm. Requestsas early as possible good progress
Bad MPI implementation: handle ‘‘unsolicited messages’’ only ifnothing else is to be done bad progress
In our case: The bad MPI implementation wins the ping-pong benchmark!In our case: The bad MPI implementation wins the ping-pong benchmark!
![Page 28: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/28.jpg)
Rolf Hempel, NEC Europe Ltd.
Point-to-point performance is very sensitive to
relative timing of send / receivethe protocol (dependent on message length)contention / lockscache effects
Ping-pong results have very limited value
Point-to-point performance is very sensitive to
relative timing of send / receivethe protocol (dependent on message length)contention / lockscache effects
Ping-pong results have very limited value
Summary:
Important to remember when writing an MPI benchmark:
Summary:
Important to remember when writing an MPI benchmark:
Trade-off between progress and message latency(invisible in ping-pong benchmark, can be important in real applications)
Trade-off between progress and message latency(invisible in ping-pong benchmark, can be important in real applications)
Parameter fittings usually lead to ‘‘bogus’’ results(they do not take protocol changes into account)
Parameter fittings usually lead to ‘‘bogus’’ results(they do not take protocol changes into account)
![Page 29: Basic message passing benchmarks, methodology and pitfalls](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a95550346895dc814ad/html5/thumbnails/29.jpg)
Rolf Hempel, NEC Europe Ltd.
Summary: (continued)
Important to remember when writing an MPI benchmark:
Summary: (continued)
Important to remember when writing an MPI benchmark:
Loop averaging is highly problematic
variance in single operation performance invisible messages from different iterations may lead to congestion
Loop averaging is highly problematic
variance in single operation performance invisible messages from different iterations may lead to congestion
Keep single benchmarks separate
different stages in a suite may interfere with each other ‘‘unsolicited message’’ problem
Keep single benchmarks separate
different stages in a suite may interfere with each other ‘‘unsolicited message’’ problem
It should not happen that
making MPI slower better benchmark results
It should not happen that
making MPI slower better benchmark results