Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP...
Transcript of Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP...
Experimental Evaluation of BSP Programming Libraries - p. 1/26
Experimental Evaluation of BSP Programming Libraries
Peter Krusche
Department of Computer ScienceCentre for Scientific Computing
University of [email protected]
Introduction
•Outline
The BSP Model
BSP Libraries
Benchmarking
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 2/26
Outline
Motivation...Study and compare the communicationcharacteristics and performance predictability of‘BSP-style’ communication libraries.
Outline1. The BSP Model
2. BSP Programming Libraries
3. Benchmarking
4. Performance and Predictability
Introduction
The BSP Model
•The BSP Model
BSP Libraries
Benchmarking
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 3/26
The BSP Model
The BSP model for parallel algorithms was usedwith some slight adaptations.
The model has been adapted here to use secondsinstead of flops as a base unit for the running time:
• p identical processor/memory pairs (computingnodes), computation speed f
• Arbitrary interconnection network, latency l,bandwidth g
Introduction
The BSP Model
•The BSP Model
BSP Libraries
Benchmarking
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 3/26
The BSP Model
M M M M M
Network
P1 P2 Pp...
• Programs are SPMD• Execution takes place in supersteps• Cost Formula : T = f · W + g · H + l · S
• As a base unit for communications, 8-byte doubles will beused
Introduction
The BSP Model
BSP Libraries
•BSP Programming
•The BSPlib Standard
•BSPlib Implementations
•Other Libraries
• ‘BSP-style’ Programming
in MPI
Benchmarking
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 4/26
BSP Programming
‘BSP-style’ programming using a conventionalcommunications library (MPI/Cray shmem/...)• Barrier synchronizations for creating superstep structure
• Many libraries already provide functionality for one sidedcommunication/direct remote memory access (DRMA)
Using a specialized library (The Oxford BSPToolset/PUB/CGMlib/...)• Specialized communication primitives (bulk synchronous
message passing/DRMA)
• Some libraries (Oxford Toolset, PUB) include optimizedbarrier synchronization functions and routing
• Higher level of abstraction
Introduction
The BSP Model
BSP Libraries
•BSP Programming
•The BSPlib Standard
•BSPlib Implementations
•Other Libraries
• ‘BSP-style’ Programming
in MPI
Benchmarking
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 5/26
The BSPlib Standard
Communication primitives:
• DRMA: buffered and unbuffered put, get
• BSMP: send and move
• Synchronization
• Combining and Broadcasting
For the experiments, a BSPlib-style wrapper librarywas created.
Introduction
The BSP Model
BSP Libraries
•BSP Programming
•The BSPlib Standard
•BSPlib Implementations
•Other Libraries
• ‘BSP-style’ Programming
in MPI
Benchmarking
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 6/26
BSPlib Implementations
The Oxford BSP Toolset• Supports 3 kinds of base architecture: message passing,
shared memory, DRMA
• Experiments used message passing MPI interface
• Last release from ’98, compatibility issues on more modernsystems
PUB• Support for message passing and shared memory
architectures
• Experiments used message passing MPI interface
• Additional support for oblivious synchronization, processorgroups
• Less trouble with setup on all systems
• Advanced functionality e.g. for process migration
Introduction
The BSP Model
BSP Libraries
•BSP Programming
•The BSPlib Standard
•BSPlib Implementations
•Other Libraries
• ‘BSP-style’ Programming
in MPI
Benchmarking
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 7/26
Other Libraries
CGMlib• Runs on top of message passing MPI
• Includes set of algorithms for sorting, list ranking, etc.
• Abstract C++ interface
• Lists of abstract datatypes with constant size are used fordata exchange
SSCRAP• Uses MPI (message passing) or Posix (SHMEM) for data
exchange
• Support for DRMA, BSMP, conventional message passing,collective operations, etc.
• ’Soft’ synchronization (send or receive)
• C++ interface
Introduction
The BSP Model
BSP Libraries
•BSP Programming
•The BSPlib Standard
•BSPlib Implementations
•Other Libraries
• ‘BSP-style’ Programming
in MPI
Benchmarking
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 8/26
‘BSP-style’ Programming in MPI
Approach here: a BSPlib style MPI-1 library wasimplemented naively (without message combining,etc.)
• Isend/Recv for data exchange
• Barrier synchronization
• Emulated DRMA on top
Advantage: no overhead for send/putDrawback: high latency, presumably overhead forget operations
Introduction
The BSP Model
BSP Libraries
Benchmarking
•Systems used
•Measuring f
•Measuring g and l
•Bandwidth Surface
(aracari)•Bandwidth Surface
(argus)•Latency
•Benchmark Summary
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 9/26
Systems used
Measurements on parallel machines at
the Centre for Scientific Computing:
aracari: IBM cluster, 64 × 2-way SMP Pentium31.4 GHz/128 GB of memory(Interconnection Network: Myrinet 2000,MPI: mpich-gm)
argus: Linux cluster, 31 × 2-way SMP Pentium4Xeon 2.6 GHz processors/62 GB of memory(Interconnection Network: 100Mbit Ethernet,MPI: mpich-p4)
Introduction
The BSP Model
BSP Libraries
Benchmarking
•Systems used
•Measuring f
•Measuring g and l
•Bandwidth Surface
(aracari)•Bandwidth Surface
(argus)•Latency
•Benchmark Summary
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 10/26
Measuring f
Measuring algorithm perfor-mance on one node:
500 1000Matrix Size
0
100 M
200 M
300 M
400 M
500 M
Flo
p/s
BLASIJK loop
Measuring computation timeseparately in one run:
0 500 1000 1500 2000Matrix Size
300 M
350 M
400 M
450 M
500 M
550 M
600 M
Flo
p/s
Actual flopratef = 400 Mflops
(Example for Matrix-Matrix multiplication)
Introduction
The BSP Model
BSP Libraries
Benchmarking
•Systems used
•Measuring f
•Measuring g and l
•Bandwidth Surface
(aracari)•Bandwidth Surface
(argus)•Latency
•Benchmark Summary
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 11/26
Measuring g and l
Problems encountered: realistic values of g and l depend on
• The number of processors that are used• The communications pattern• The communication volume
E.g. for all-to-all communication on aracari
50 100 150 200 250Number of messages
0
2000
4000
6000
Com
mun
icat
ion
time
[us]
MPI, 4 nodesMPI, 16 nodesOxtool, 4 nodesOxtool, 16 nodes
Introduction
The BSP Model
BSP Libraries
Benchmarking
•Systems used
•Measuring f
•Measuring g and l
•Bandwidth Surface
(aracari)•Bandwidth Surface
(argus)•Latency
•Benchmark Summary
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 12/26
Bandwidth Surface (aracari)
For a better picture, the effective bandwidth can be sampleddepending on message size and count.
All-to-all communication on aracari:
tim
e p
er e
lem
ent
[us]
m (n
umber
of mes
sages
)
message size
PUB 16 nodesOXTOOL 16 nodes
MPI 16 nodes
2
8
32
128
28
32128
512
0.1
1
10
100
1000
Introduction
The BSP Model
BSP Libraries
Benchmarking
•Systems used
•Measuring f
•Measuring g and l
•Bandwidth Surface
(aracari)•Bandwidth Surface
(argus)•Latency
•Benchmark Summary
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 12/26
Bandwidth Surface (aracari)
For a better picture, the effective bandwidth can be sampleddepending on message size and count.
All-to-all communication on aracari:
tim
e p
er e
lem
ent
[us]
m (n
umber
of mes
sages
)
message size
PUB 16 nodesOXTOOL 16 nodes
2
8
32
128
28
32128
512
0.1
1
10
100
Introduction
The BSP Model
BSP Libraries
Benchmarking
•Systems used
•Measuring f
•Measuring g and l
•Bandwidth Surface
(aracari)•Bandwidth Surface
(argus)•Latency
•Benchmark Summary
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 12/26
Bandwidth Surface (aracari)
For a better picture, the effective bandwidth can be sampleddepending on message size and count.
Random permutation on aracari:
tim
e p
er e
lem
ent
[us]
m (n
umber
of mes
sages
)
message size
PUB 16 nodesOXTOOL 16 nodes
MPI 16 nodes
2 8
32 128
512
28
32128
512
0.1 1
10 100
10001e+04
Introduction
The BSP Model
BSP Libraries
Benchmarking
•Systems used
•Measuring f
•Measuring g and l
•Bandwidth Surface
(aracari)•Bandwidth Surface
(argus)•Latency
•Benchmark Summary
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 12/26
Bandwidth Surface (aracari)
For a better picture, the effective bandwidth can be sampleddepending on message size and count.
Random permutation on aracari:
tim
e p
er e
lem
ent
[us]
m (n
umber
of mes
sages
)
message size
PUB 16 nodesOXTOOL 16 nodes
2 8
32 128
512
28
32128
512
0.1
1
10
100
1000
Introduction
The BSP Model
BSP Libraries
Benchmarking
•Systems used
•Measuring f
•Measuring g and l
•Bandwidth Surface
(aracari)•Bandwidth Surface
(argus)•Latency
•Benchmark Summary
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 13/26
Bandwidth Surface (argus)
The picture looks different on the slower communicationsnetwork
All-to-all communication on argus:
tim
e p
er e
lem
ent
[us]
m (n
umber
of mes
sages
)
message size
PUB 4 nodesOXTOOL 4 nodes
MPI 4 nodes
2
8
32
128
28
32128
512
1
10
100
1000
1e+04
Introduction
The BSP Model
BSP Libraries
Benchmarking
•Systems used
•Measuring f
•Measuring g and l
•Bandwidth Surface
(aracari)•Bandwidth Surface
(argus)•Latency
•Benchmark Summary
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 13/26
Bandwidth Surface (argus)
The picture looks different on the slower communicationsnetwork
All-to-all communication on argus:
tim
e p
er e
lem
ent
[us]
m (n
umber
of mes
sages
)
message size
PUB 4 nodesOXTOOL 4 nodes
2
8
32
128
28
32128
512
1
10
100
1000
Introduction
The BSP Model
BSP Libraries
Benchmarking
•Systems used
•Measuring f
•Measuring g and l
•Bandwidth Surface
(aracari)•Bandwidth Surface
(argus)•Latency
•Benchmark Summary
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 13/26
Bandwidth Surface (argus)
The picture looks different on the slower communicationsnetwork
Random permutation on argus:
tim
e p
er e
lem
ent
[us]
m (n
umber
of mes
sages
)
message size
PUB 4 nodesOXTOOL 4 nodes
MPI 4 nodes
2 8
32 128
512
28
32128
512
0.01 0.1 1
10 100
10001e+04
Introduction
The BSP Model
BSP Libraries
Benchmarking
•Systems used
•Measuring f
•Measuring g and l
•Bandwidth Surface
(aracari)•Bandwidth Surface
(argus)•Latency
•Benchmark Summary
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 13/26
Bandwidth Surface (argus)
The picture looks different on the slower communicationsnetwork
Random permutation on argus:
tim
e p
er e
lem
ent
[us]
m (n
umber
of mes
sages
)
message size
PUB 4 nodesOXTOOL 4 nodes
2 8
32 128
512
28
32128
512
0.01 0.1 1
10 100
10001e+04
Introduction
The BSP Model
BSP Libraries
Benchmarking
•Systems used
•Measuring f
•Measuring g and l
•Bandwidth Surface
(aracari)•Bandwidth Surface
(argus)•Latency
•Benchmark Summary
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 14/26
Latency
The latency can be measured for synchronizations precededby different types of communication:
MPI Oxtool PUBaracari 4 processors
l (low) 210 µs 43 µs 39 µsl (high) 230 µs 67 µs 55 µsl (all-to-all) 252 µs 89 µs 72 µs
aracari 32 processors
l (low) 2203 µs 621 µs 142 µsl (high) 2242 µs 638 µs 163 µsl (all-to-all) 2881 µs 1250 µs 750 µs
argus 4 processors
l (low) 5642 µs 796 µs 975 µsl (high) 5789 µs 1442 µs 1176 µsl (all-to-all) 5086 µs 1613 µs 871 µs
Introduction
The BSP Model
BSP Libraries
Benchmarking
•Systems used
•Measuring f
•Measuring g and l
•Bandwidth Surface
(aracari)•Bandwidth Surface
(argus)•Latency
•Benchmark Summary
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 15/26
Benchmark Summary
Bandwidth depends on message count, messagesize and the communications pattern
On aracari:• Best all-to-all performance: Oxtool
• Best random permutation performance (fewmessages): PUB, > 64 messages: Oxtool
• Best self communication performance (fewmessages): PUB, > 32 messages: Oxtool
• MPI: good performance when message size islarge
Introduction
The BSP Model
BSP Libraries
Benchmarking
•Systems used
•Measuring f
•Measuring g and l
•Bandwidth Surface
(aracari)•Bandwidth Surface
(argus)•Latency
•Benchmark Summary
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 15/26
Benchmark Summary
Bandwidth depends on message count, messagesize and the communications pattern
On argus:• Best all-to-all performance: PUB
• Random permutation: little difference betweenPUB and Oxtool
• Best self communication performance (fewmessages): Oxtool
• MPI: good performance when message size andcount are larger than 16/32 doubles
Introduction
The BSP Model
BSP Libraries
Benchmarking
•Systems used
•Measuring f
•Measuring g and l
•Bandwidth Surface
(aracari)•Bandwidth Surface
(argus)•Latency
•Benchmark Summary
Performance/Predictions
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 15/26
Benchmark Summary
Latency:
• PUB consistently has best latency (without usingthe faster ‘oblivious’ synchronization)
• As expected, ‘naive’ MPI library has highestlatency
Introduction
The BSP Model
BSP Libraries
Benchmarking
Performance/Predictions
•BSP Matrix-Matrix
Multiplication•BSP Matrix-Matrix
Multiplication (2)•Why this algorithm?
•Prediction model
•Prediction results on
aracari•Prediction results, more
processors•Speedup Results (1)
•Speedup Results (2)
•Performance Comparison
with PBLAS
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 16/26
BSP Matrix-Matrix Multiplication
We want to compute the product of two dense n × n matricesA and B
Simple formula:
cik =n
∑
j=1
aijbjk, having A = [aij ] , B = [bij ] , C = [cij]
V
C
A
B
aij
bjk
vijk
cik
Introduction
The BSP Model
BSP Libraries
Benchmarking
Performance/Predictions
•BSP Matrix-Matrix
Multiplication•BSP Matrix-Matrix
Multiplication (2)•Why this algorithm?
•Prediction model
•Prediction results on
aracari•Prediction results, more
processors•Speedup Results (1)
•Speedup Results (2)
•Performance Comparison
with PBLAS
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 17/26
BSP Matrix-Matrix Multiplication (2)
Block decomposition into q blocks for memory efficientparallel algorithm:
CIK =
q1/3
∑
J=1
VIJK with I, K = 1, 2, ..., q1/3
V
C
A
B
Introduction
The BSP Model
BSP Libraries
Benchmarking
Performance/Predictions
•BSP Matrix-Matrix
Multiplication•BSP Matrix-Matrix
Multiplication (2)•Why this algorithm?
•Prediction model
•Prediction results on
aracari•Prediction results, more
processors•Speedup Results (1)
•Speedup Results (2)
•Performance Comparison
with PBLAS
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 18/26
Why this algorithm?
• Communication block size can be controlled byparameter q
• Message combining has to be used when usingfixed initial data distribution (block-cyclic withblock width n/
√p)
• Can be compared e.g. to PBLAS
• ‘Nice’ version can predistribute the blocks beforethe computation to avoid spikes because of datadistribution
Introduction
The BSP Model
BSP Libraries
Benchmarking
Performance/Predictions
•BSP Matrix-Matrix
Multiplication•BSP Matrix-Matrix
Multiplication (2)•Why this algorithm?
•Prediction model
•Prediction results on
aracari•Prediction results, more
processors•Speedup Results (1)
•Speedup Results (2)
•Performance Comparison
with PBLAS
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 19/26
Prediction model
BSP running time:
T = f ·
⌈
q
p
⌉
·n3
q+
g ·
⌈
q
p
⌉
·n2
q2/3·(
2 + 1/q1/3
)
+
l · 2
⌈
q
p
⌉
Two matrices are transferredrow by row→ value of g is taken fromthe red line as value for max-imum matrix size
Introduction
The BSP Model
BSP Libraries
Benchmarking
Performance/Predictions
•BSP Matrix-Matrix
Multiplication•BSP Matrix-Matrix
Multiplication (2)•Why this algorithm?
•Prediction model
•Prediction results on
aracari•Prediction results, more
processors•Speedup Results (1)
•Speedup Results (2)
•Performance Comparison
with PBLAS
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 20/26
Prediction results on aracari
Oxtool, using 4 processors
0 500 1000 1500 2000Matrix Size
0
5
10
15
20
Run
ning
tim
e [s
] q=27 mean error 9.89% q=125 mean error 10.60% q=343 mean error 13.06% q=729 mean error 13.56%prediction q=27prediction q=125prediction q=343prediction q=729
p = 4, 1/f = 4e+08 g = 3.5e-07, l = 8.9e-05
Introduction
The BSP Model
BSP Libraries
Benchmarking
Performance/Predictions
•BSP Matrix-Matrix
Multiplication•BSP Matrix-Matrix
Multiplication (2)•Why this algorithm?
•Prediction model
•Prediction results on
aracari•Prediction results, more
processors•Speedup Results (1)
•Speedup Results (2)
•Performance Comparison
with PBLAS
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 20/26
Prediction results on aracari
PUB, using 4 processors
0 500 1000 1500 2000Matrix Size
0
10
20
30
40
50
60
Run
ning
tim
e [s
] q=27 mean error 52.68% q=125 mean error 91.92% q=343 mean error 110.39% q=729 mean error 120.03%prediction q=27prediction q=125prediction q=343prediction q=729
p = 4, 1/f = 4e+08 g = 2.5e-06, l = 7.2e-05
Introduction
The BSP Model
BSP Libraries
Benchmarking
Performance/Predictions
•BSP Matrix-Matrix
Multiplication•BSP Matrix-Matrix
Multiplication (2)•Why this algorithm?
•Prediction model
•Prediction results on
aracari•Prediction results, more
processors•Speedup Results (1)
•Speedup Results (2)
•Performance Comparison
with PBLAS
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 20/26
Prediction results on aracari
MPI, using 4 processors
0 500 1000 1500 2000Matrix Size
0
5
10
15
20
Run
ning
tim
e [s
] q=27 mean error 4.04% q=125 mean error 4.92% q=343 mean error 7.44% q=729 mean error 8.20%prediction q=27prediction q=125prediction q=343prediction q=729
p = 4, 1/f = 4e+08 g = 3.4e-07, l = 0.00025
Introduction
The BSP Model
BSP Libraries
Benchmarking
Performance/Predictions
•BSP Matrix-Matrix
Multiplication•BSP Matrix-Matrix
Multiplication (2)•Why this algorithm?
•Prediction model
•Prediction results on
aracari•Prediction results, more
processors•Speedup Results (1)
•Speedup Results (2)
•Performance Comparison
with PBLAS
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 21/26
Prediction results, more processors
Oxtool, using 32 processors
0 500 1000 1500 2000Matrix Size
0
0.5
1
1.5
2
2.5
3
Run
ning
tim
e [s
] q=343 mean error 20.36% q=512 mean error 31.71% q=729 mean error 34.20%prediction q=343prediction q=512prediction q=729
p = 32, 1/f = 4e+08 g = 5.3e-07, l = 0.0013
Introduction
The BSP Model
BSP Libraries
Benchmarking
Performance/Predictions
•BSP Matrix-Matrix
Multiplication•BSP Matrix-Matrix
Multiplication (2)•Why this algorithm?
•Prediction model
•Prediction results on
aracari•Prediction results, more
processors•Speedup Results (1)
•Speedup Results (2)
•Performance Comparison
with PBLAS
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 22/26
Speedup Results (1)
aracari, using 16 processors
0 500 1000 1500 2000Matrix Size
0
5
10
15
20
25
30S
peed
upsequentialOxtool: q=64Oxtool: q=729PUB: q=64PUB: q=729MPI: q=64MPI: q=729
Introduction
The BSP Model
BSP Libraries
Benchmarking
Performance/Predictions
•BSP Matrix-Matrix
Multiplication•BSP Matrix-Matrix
Multiplication (2)•Why this algorithm?
•Prediction model
•Prediction results on
aracari•Prediction results, more
processors•Speedup Results (1)
•Speedup Results (2)
•Performance Comparison
with PBLAS
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 23/26
Speedup Results (2)
aracari, using 32 processors (spikes when matrix size mod6 == 0)
0 500 1000 1500 2000Matrix Size
0
10
20
30
40
50S
peed
upsequentialOxtool: q=216Oxtool: q=729PUB: q=216PUB: q=729MPI: q=216MPI: q=729
Introduction
The BSP Model
BSP Libraries
Benchmarking
Performance/Predictions
•BSP Matrix-Matrix
Multiplication•BSP Matrix-Matrix
Multiplication (2)•Why this algorithm?
•Prediction model
•Prediction results on
aracari•Prediction results, more
processors•Speedup Results (1)
•Speedup Results (2)
•Performance Comparison
with PBLAS
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 24/26
Performance Comparison with PBLAS
4000 6000 8000 10000Matrix size
0
5 G
10 G
15 G
20 G
flop/
sPBLAS, 36 nodesPBLAS, 16 nodesMPI, 36 nodesMPI, 16 nodesOxtool, 36 nodesOxtool, 16 nodesPUB, 36 nodesPUB, 16 nodes
Introduction
The BSP Model
BSP Libraries
Benchmarking
Performance/Predictions
•BSP Matrix-Matrix
Multiplication•BSP Matrix-Matrix
Multiplication (2)•Why this algorithm?
•Prediction model
•Prediction results on
aracari•Prediction results, more
processors•Speedup Results (1)
•Speedup Results (2)
•Performance Comparison
with PBLAS
Conclusion
Experimental Evaluation of BSP Programming Libraries - p. 24/26
Performance Comparison with PBLAS
2000 4000 6000 8000 10000Matrix size
2 G
4 G
6 G
8 G
10 G
12 G
flop/
sPBLAS, 36 nodesPUB, 32 nodesMPI, 32 nodesOxtool, 32 nodes
Introduction
The BSP Model
BSP Libraries
Benchmarking
Performance/Predictions
Conclusion
•Summary
•Further Work
Experimental Evaluation of BSP Programming Libraries - p. 25/26
Summary
• Despite restrictions due to BSP model, all implementationsreach good speedup on aracari when number of blocksis low
• Overall benchmark results look better for PUB• Oxtool has best matrix multiplication performance onaracari (Myrinet)
• PUB has best matrix multiplication performance on argus(Ethernet)
• Predictably no real speedup on argus, due to slowcommunications network and fast nodes
• Performance of simple BSP algorithm is comparable withPBLAS
Introduction
The BSP Model
BSP Libraries
Benchmarking
Performance/Predictions
Conclusion
•Summary
•Further Work
Experimental Evaluation of BSP Programming Libraries - p. 26/26
Further Work
• Run experiments on shared memory machine
• Use more different communication patterns forbenchmarking
• Study other algorithms with differentcommunication patterns
• Keeping simplicity, extend prediction model formore accuracy