Basic message passing benchmarks, methodology and pitfalls

Basic message passing benchmarks, methodology and pitfalls

Basic message passing benchmarks, methodology and pitfalls

Further info.: http://www.ccrl-nece.technopark.gmd.deFurther info.: http://www.ccrl-nece.technopark.gmd.de

Rolf Hempel

C & C Research Laboratories, NEC Europe Ltd.Rathausallee 1053757 Sankt AugustinGermanyemail: [email protected]

Rolf Hempel

C & C Research Laboratories, NEC Europe Ltd.Rathausallee 1053757 Sankt AugustinGermanyemail: [email protected]

SPEC Workshop, Wuppertal, Sept. 13, 1999SPEC Workshop, Wuppertal, Sept. 13, 1999

Rolf Hempel, NEC Europe Ltd.

NEC R&D Group Overseas FacilitiesNEC R&D Group Overseas Facilities

C&C Research LaboratoriesNEC Europe Ltd.(Bonn, Germany)

C&C Research LaboratoriesNEC Europe Ltd.(Bonn, Germany)

JapanJapan

NEC Research Institute, Inc.(Princeton, U.S.A.)

C&C Research Laboratories, NEC U.S.A., Inc.(Princeton, San Jose, U.S.A.)

NEC Research Institute, Inc.(Princeton, U.S.A.)

C&C Research Laboratories, NEC U.S.A., Inc.(Princeton, San Jose, U.S.A.)


MPI Implementations at NEC CCRLE:MPI Implementations at NEC CCRLE:

Cenju-4

MPI for Cenjuderived fromMPI/SX LAMP

MPI forPC cluster

Image courtesy of National Space Development Agency of Japan /

Japan Atomic Energy Research Institute

Earth Simulator

MPI design forEarth Simulator

SX-5

MPI/SX product development(currently for SX-4/5)

SX-5M


Most important MPI implementation: MPI/SXMost important MPI implementation: MPI/SX

SX-5 (in the past SX-4):Commercial product of NEC,Parallel vector supercomputer

SX-5 (in the past SX-4):Commercial product of NEC,Parallel vector supercomputer

Since 1997: MPI/SX product development at CCRLEStandard-compliant, fully tested interfaceOptimized MPI-1, MPI-2 almost finishedMaintenance & Customer support

Since 1997: MPI/SX product development at CCRLEStandard-compliant, fully tested interfaceOptimized MPI-1, MPI-2 almost finishedMaintenance & Customer support


Image courtesy of National Space Development Agency of Japan /

Japan Atomic Energy Research institute

MPI design for the Earth SimulatorMPI design for the Earth Simulator

Massively parallel computer:Thousands of processors

Massively parallel computer:Thousands of processors

MPI is the basis for alllarge-scale applications

MPI is the basis for alllarge-scale applications

Design and implementationat CCRLE

Design and implementationat CCRLE

At the moment: design in collaboration with OS group at NEC FuchuAt the moment: design in collaboration with OS group at NEC Fuchu


The purpose of message-passing benchmarksThe purpose of message-passing benchmarks

Goal: measure the time needed for elementary operations(such as send/receive)

Goal: measure the time needed for elementary operations(such as send/receive)

Behavior of application programs can be modeledBehavior of application programs can be modeled

Problem: difficult to measure time for single operation:no global clockclock resolution too lowdifficult to differentiate: receive time synchronization delay

Problem: difficult to measure time for single operation:no global clockclock resolution too lowdifficult to differentiate: receive time synchronization delay


Standard solution: measure loops over communication operationsStandard solution: measure loops over communication operations

Simple example: the ping-pong benchmark:Simple example: the ping-pong benchmark:

Process 0 Process 1

Send to 1 Recv from 0

Send to 0Recv from 1

Iterate 1000 timesmeasure time Tfor entire loop

Iterate 1000 timesmeasure time Tfor entire loop

Result: Time for single message = T / 2000 Result: Time for single message = T / 2000


Implicit assumption: The time for a message in an application codewill be similar to the benchmark result

Implicit assumption: The time for a message in an application codewill be similar to the benchmark result

Why is this usually not the case?Why is this usually not the case?

1. Receiver in ping-pong is always ready to receivereceive in ‘solicited message’ mode

delays or intermediate copies can be avoided

1. Receiver in ping-pong is always ready to receivereceive in ‘solicited message’ mode

delays or intermediate copies can be avoided

2. Only two processes are activeno contention on interconnect system, non-scaling effects are not visible (e.g. locks on global data structures)

2. Only two processes are activeno contention on interconnect system, non-scaling effects are not visible (e.g. locks on global data structures)


3. Untypical response to different progress concepts:3. Untypical response to different progress concepts:

Single-threadedMPI implementation:

Single-threadedMPI implementation:

Application process

Call to MPI_Recv

MPI library:check for andservice outstandingrequests

Multi-threadedMPI implementation:

Multi-threadedMPI implementation:

Application thread

Call to MPI_Recv

MPIprocess

MPIprocessCommunication thread

constantly check forand service comm.requests


Single-threaded MPI:progress at receiver onlywhen MPI routine is called

Single-threaded MPI:progress at receiver onlywhen MPI routine is called

Multi-threaded MPI:progress at receiver independentof application thread

Multi-threaded MPI:progress at receiver independentof application thread

Bad sender/receiversynchronization bad progress

Bad sender/receiversynchronization bad progress

Communication progress independentof sender/receiver synchronization

Communication progress independentof sender/receiver synchronization

This advantage is not seen in theping-pong benchmark!single-threaded always better

This advantage is not seen in theping-pong benchmark!single-threaded always better

(Comparison of ping-pong latencyon Myrinet PC cluster)

(Comparison of ping-pong latencyon Myrinet PC cluster)


4. Data may be cached between loop iterations:4. Data may be cached between loop iterations:

cache

MPI processactive in ping-pong

MPI_Recv

MPI_Send

Much higher bandwidththan in real applications

Much higher bandwidththan in real applications

Comparison of ping-pong bandwidthfor Myrinet PC cluster cached versus non-cached data

Comparison of ping-pong bandwidthfor Myrinet PC cluster cached versus non-cached data


This can be even worse on CC-NUMA architectures (e.g. SGI Origin 2000):This can be even worse on CC-NUMA architectures (e.g. SGI Origin 2000):

Operation to be timed: call shmem_get8 (b(1), a(1), n, myrank-1)Operation to be timed: call shmem_get8 (b(1), a(1), n, myrank-1)

Processor myrank

b

Cache

Processor myrank-1

a

A(1)A(2)

A(n)

First transfer only

First transfer only

All transfersAll transfers

Consequence:All but first transfersare VERY fast!

Consequence:All but first transfersare VERY fast!


The problem of parametrizationThe problem of parametrization

Goal: Instead of displaying a performance graph,define a few parameters that characterizethe communication behavior

Goal: Instead of displaying a performance graph,define a few parameters that characterizethe communication behavior

Most popularParameters:

Latency: time for 0 byte messageBandwidth: asymptotic throughput for long messages

Most popularParameters:

Latency: time for 0 byte messageBandwidth: asymptotic throughput for long messages


Problem: This model assumes a very simple communication protocol It is not consistent with most MPI implementations

Example: MPI on the NEC Cenju-4 (MPP system, R10000 processors)

Problem: This model assumes a very simple communication protocol It is not consistent with most MPI implementations

Example: MPI on the NEC Cenju-4 (MPP system, R10000 processors)Blu

e fitt

ing:

lat

ency

= 8

.5

sec

Red fitting: la

tency = 24 sec

Something seems tochange atmessage size = 1024 bytes

Something seems tochange atmessage size = 1024 bytes


It can be even worse: Discontinuities in timing curve

Example: NEC SX-4, send / receive within 32 processor node

It can be even worse: Discontinuities in timing curve

Example: NEC SX-4, send / receive within 32 processor node


In most modern MPI implementations: Three different protocols,depending on message length

In most modern MPI implementations: Three different protocols,depending on message length

Short messagesShort messagesSender process Receiver process

Pre-defined slotsMessage envelope + data

Get data whenreceiver ready

Get data whenreceiver ready

‘’Long’’ messages‘’Long’’ messagesSender process

Receiver process

Pre-defined slotsMessage envelope

Synchronization Synchronization

Local copy of message data

‘’Eager’’ messages‘’Eager’’ messagesSender process Receiver process

Pre-defined slotsMessage envelope

User memoryReceive buffer

Send buffer Receive buffer


Good reasons for protocol changes:Good reasons for protocol changes:

Short messages: Copying does not cost much Important not to block the sender for too long Use pre-allocated slots for intermediate copy avoid time to allocate memory

‘‘Eager’’ messages: Copying does not cost much Important not to block the sender for too long Allocate intermediate buffer on the fly

Long messages: Copying is expensive, avoid intermediate copiesRendezvous between sender and receiverSynchronous data transfer optimal throughput

In SX-5 implementation: Protocol changes at 1024 and 100,000 Bytes


Solution: We made our MPI SLOWER for long messages tilt of fitting line much better latency

Solution: We made our MPI SLOWER for long messages tilt of fitting line much better latency

Protocol changes are dangerous for curve fittings!Protocol changes are dangerous for curve fittings!

Famous example: The COMMS1 - COMMS3 benchmarks

Procurement at NEC: A Customer requested a certain message latency,to be measured with COMMS1

Problem: Simplistic curve fitting in COMMS1 SX-4 latency was too high

Famous example: The COMMS1 - COMMS3 benchmarks

Procurement at NEC: A Customer requested a certain message latency,to be measured with COMMS1

Problem: Simplistic curve fitting in COMMS1 SX-4 latency was too high

a happy customer!


The problem of averagingThe problem of averaging

Problem: difficult to measure time for single MPI operation:no global clockclock resolution too lowdifficult to differentiate:

receive time synchronization delay

Problem: difficult to measure time for single MPI operation:no global clockclock resolution too lowdifficult to differentiate:

receive time synchronization delay

Solution: Averaging over many iterations

Implicit assumption: All iterations take the same timeAveraging increases

accuracy

This is not always true!

Solution: Averaging over many iterations

Implicit assumption: All iterations take the same timeAveraging increases

accuracy

This is not always true!


Example: Remote get operation shmem_get8 on Cray T3EExample: Remote get operation shmem_get8 on Cray T3E

Experiment with message size of 10 Kbytes: 100 measurement runs (tests) In each test, ITER calls to shmem_get8 in a loop For each test, time divided by ITER Time for a single operation

The graph on the next slide shows the results for 100 tests.

Experiment with message size of 10 Kbytes: 100 measurement runs (tests) In each test, ITER calls to shmem_get8 in a loop For each test, time divided by ITER Time for a single operation

The graph on the next slide shows the results for 100 tests.

The reported results are courtesy Prof. Glenn LueckeIowa State UniversityAmes, IowaU.S.A.

The reported results are courtesy Prof. Glenn LueckeIowa State UniversityAmes, IowaU.S.A.


There are 7 huge spikes in the graph.Probable reason: operating system interruptions.Should those tests be included in the average time?

There are 7 huge spikes in the graph.Probable reason: operating system interruptions.Should those tests be included in the average time?


Real*8 a, b, t1, t2

t1 = MPI_WTIME ()Do i = 1, 1000 call MPI_REDUCE (a, b, 1, MPI_DOUBLE_PRECISION,

+ MPI_MAX, 0, MPI_COMM_WORLD, IERROR)end dot2 = MPI_WTIME ()

Real*8 a, b, t1, t2

t1 = MPI_WTIME ()Do i = 1, 1000 call MPI_REDUCE (a, b, 1, MPI_DOUBLE_PRECISION,

+ MPI_MAX, 0, MPI_COMM_WORLD, IERROR)end dot2 = MPI_WTIME ()

Let’s compare the behavior of two MPI implementations:Let’s compare the behavior of two MPI implementations:

Test setup: Root process plus six child processes

Reduction on an 8 byte real Time is measured for 1000

iterations

Test setup: Root process plus six child processes

Reduction on an 8 byte real Time is measured for 1000

iterations

A more subtle example: MPI_ReduceA more subtle example: MPI_Reduce

Second implementation, details:

Synchronous algorithm Senders are blocked, until the last process is ready High synchronization delay

relatively bad algorithm

Second implementation, details:

Synchronous algorithm Senders are blocked, until the last process is ready High synchronization delay

relatively bad algorithm

First implementation, details:

Asynchronous algorithm Senders are not blocked, independent of receiver status Low synchronization delay

relatively good

algorithm

First implementation, details:

Asynchronous algorithm Senders are not blocked, independent of receiver status Low synchronization delay

relatively good

algorithm

2. Step: perform reduction as infirst algorithm

1. Step: synchronize all procs

root = receiver

senders

root = receiver

senders


What happens at runtime?What happens at runtime?

root = receiver

senders

For the first MPI implementation:

Leaf processes are fastest, and arenot blocked in single operation They quickly go through loopMessages pile up at intermediate processors Increasing message queues slow down those processors At some point flow-control takes over, but progress is slow

For the first MPI implementation:

Leaf processes are fastest, and arenot blocked in single operation They quickly go through loopMessages pile up at intermediate processors Increasing message queues slow down those processors At some point flow-control takes over, but progress is slow

Important to remember: We want to measure the time for ONE reduction operation. We only use loop averaging to increase the accuracy.

Important to remember: We want to measure the time for ONE reduction operation. We only use loop averaging to increase the accuracy.

This does not happen with the second (bad) MPI implementation!

The unnecessary synchronization keeps messages from piling up.

This does not happen with the second (bad) MPI implementation!

The unnecessary synchronization keeps messages from piling up.


Compound BenchmarksCompound Benchmarks

MPI_Send / MPI_RecvMPI_Send / MPI_Recv

MPI_Bsend / MPI_RecvMPI_Bsend / MPI_Recv

MPI_Isend / MPI_RecvMPI_Isend / MPI_Recv

MPI_ReduceMPI_Reduce

MPI_BcastMPI_Bcast

Benchmark driver:

user interfacecall individual componentsreport timings / statisticscompute effective benchmark parameters

Benchmark suite executed in a single runElegant and flexible to use

Benchmark suite executed in a single runElegant and flexible to use

Can there be any problems? ...... Guess what!Can there be any problems? ...... Guess what!


In a parallel program: Good idea not to print results in every process(or the ordering is up to chance)

Alternative: Send messages to process 0,Do all the printing there

In a parallel program: Good idea not to print results in every process(or the ordering is up to chance)

Alternative: Send messages to process 0,Do all the printing there

What if the active process sets change between benchmark components?What if the active process sets change between benchmark components?

MPI_Send / MPI_RecvMPI_Send / MPI_Recv

MPI_Bsend / MPI_RecvMPI_Bsend / MPI_Recv

MPI_Isend / MPI_RecvMPI_Isend / MPI_Recv

MPI_ReduceMPI_Reduce

MPI_BcastMPI_Bcast


We have seen a public benchmark suite in which:

In every phase, every process sends a message to process 0

Timing results for this phase Message ‘‘I am not active in this phase’’

At the same time: process 0 is active in the ping-pong with process 1

Result: Process 0 gets bombarded with ‘‘unsolicited messages’’ during the ping-pong.

We have seen a public benchmark suite in which:

In every phase, every process sends a message to process 0

Timing results for this phase Message ‘‘I am not active in this phase’’

At the same time: process 0 is active in the ping-pong with process 1

Result: Process 0 gets bombarded with ‘‘unsolicited messages’’ during the ping-pong.

Good MPI implementation: handle incoming comm. Requestsas early as possible good progress

Bad MPI implementation: handle ‘‘unsolicited messages’’ only ifnothing else is to be done bad progress

Good MPI implementation: handle incoming comm. Requestsas early as possible good progress

Bad MPI implementation: handle ‘‘unsolicited messages’’ only ifnothing else is to be done bad progress

In our case: The bad MPI implementation wins the ping-pong benchmark!In our case: The bad MPI implementation wins the ping-pong benchmark!


Point-to-point performance is very sensitive to

relative timing of send / receivethe protocol (dependent on message length)contention / lockscache effects

Ping-pong results have very limited value

Point-to-point performance is very sensitive to

relative timing of send / receivethe protocol (dependent on message length)contention / lockscache effects

Ping-pong results have very limited value

Summary:

Important to remember when writing an MPI benchmark:

Summary:


Trade-off between progress and message latency(invisible in ping-pong benchmark, can be important in real applications)

Trade-off between progress and message latency(invisible in ping-pong benchmark, can be important in real applications)

Parameter fittings usually lead to ‘‘bogus’’ results(they do not take protocol changes into account)

Parameter fittings usually lead to ‘‘bogus’’ results(they do not take protocol changes into account)


Summary: (continued)


Summary: (continued)


Loop averaging is highly problematic

variance in single operation performance invisible messages from different iterations may lead to congestion

Loop averaging is highly problematic

variance in single operation performance invisible messages from different iterations may lead to congestion

Keep single benchmarks separate

different stages in a suite may interfere with each other ‘‘unsolicited message’’ problem

Keep single benchmarks separate

different stages in a suite may interfere with each other ‘‘unsolicited message’’ problem

It should not happen that

making MPI slower better benchmark results

It should not happen that

making MPI slower better benchmark results

Basic message passing benchmarks, methodology and pitfalls

Documents

Transcript of Basic message passing benchmarks, methodology and pitfalls