Scaling Alltoall Collective on Multi-core...
Transcript of Scaling Alltoall Collective on Multi-core...
![Page 1: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/1.jpg)
Scaling Alltoall Collective
on Multi-core Systems
Rahul Kumar, Amith R Mamidala,
Dhabaleswar K Panda
Department of Computer Science & EngineeringThe Ohio State University
{kumarra, mamidala, panda}@cse.ohio-state.edu
![Page 2: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/2.jpg)
Presentation Outline•Introduction
•Motivation & Problem Statement
•Proposed Design
•Performance Evaluation
•Conclusion & Future Work
![Page 3: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/3.jpg)
Introduction
Multi-core architectures being widely used for
high performance computingRanger cluster at TACC has 16 core/node and in total more
than 60,000 cores
Message Passing is the default programming
model for distributed memory systems
MPI provides many communication primitives
MPI Collective operations are widely used in
applications
![Page 4: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/4.jpg)
Introduction
MPI_alltoall is the most intensive collective
and is widely used in many applications such
as CPMD, NAMD, FFT, Matrix transpose.
In MPI_Alltoall every process has a different
data to be sent to every other process.
An efficient alltoall is highly desirable for
multi-core systems as the number of
processes have increased dramatically due
to cheap cost ratio of multi-core architecture
![Page 5: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/5.jpg)
Introduction
24% of the top 500 supercomputers use
InfiniBand as their interconnect (based on
Nov „07 rankings).
Several different implementations of
InfiniBand Network Interfaces
Offload implementation e.g. InfiniHost III(3rd
generation cards from Mellanox)
Onload implementation e.g. Qlogic InfiniPath
Combination of both onload and offload e.g.
ConnectX from Mellanox.
![Page 6: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/6.jpg)
Offload & Onload Architecture
NIC NIC
INFINIBAND
Offload architecture Onload architecture
NIC NIC
INFINIBAND
CoreNode Node Node Node
In an offload architecture, the network processing is offloaded to network
interface. The NIC is able to send message relieving the CPU of communication
In an onload architecture, the CPU is involved in communication in addition to
performing the computation
In onload architecture, the faster CPU is able to speed up the communication.
However, ability to overlap communication with computation is not possible
![Page 7: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/7.jpg)
Characteristics of various
Network Interfaces• Some basic experiments were performed on
various network architectures and the
following observations were made
• The bi-directional bandwidth of onload
network interfaces increases with more
number of cores used to push the data on the
network
• This is shown in the following slides
![Page 8: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/8.jpg)
Bi-directional Bandwidth: InfiniPath (onload)
•Bidirectional Bandwidth increases with more cores used to push data
•In onload interface, more cores help achieve better network utilization
![Page 9: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/9.jpg)
Bi-directional Bandwidth: ConnectX
•A similar trend is also observed for connectX network interfaces
![Page 10: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/10.jpg)
Bi-directional Bandwidth: InfiniHost III (offload)
•However, in Offload network interfaces the bandwidth drops on using more
cores
•We feel this to be due to congestion at the network interface on using many
cores simultaneously
![Page 11: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/11.jpg)
Results from the Experiments
• Depending on the interface
implementation, their characteristics
differ
– Qlogic onload implementations: Using more cores
simultaneously for inter-node communication is
beneficial
– Mellanox offload implementations: Using less
cores at the same time for inter-node communication
is beneficial
– Mellanox ConnectX architecture: Using more cores
simultaneously is beneficial
![Page 12: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/12.jpg)
Presentation Outline•Introduction
•Motivation & Problem Statement
•Proposed Design
•Performance Evaluation
•Conclusion & Future Work
![Page 13: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/13.jpg)
• To evaluate the performance of existing alltoall
algorithm we conduct the following experiment
• In the experiment alltoall time is measured on a
set of nodes.
• The number of cores per node participating in
alltoall are increased gradually.
Motivation
![Page 14: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/14.jpg)
Motivation
•The alltoall time doubles on doubling the number of cores in the nodes
![Page 15: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/15.jpg)
What is the problem with the
Algorithm?
•Alltoall between two nodes involves one communication step
Node 1 Node 2
•So on doubling the core alltoall time is almost doubled.
•This is exactly what we obtained from the previous experiment.
•With two cores per node, the number of inter-node communication
by each core increases to two
Cores
![Page 16: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/16.jpg)
Problem Statement
• Can low cost shared memory help to avoid
network transactions?
• Can the performance of alltoall be improved
especially for multi-core systems?
• What algorithms to choose for different
infiniband implementations?
![Page 17: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/17.jpg)
Related WorkThere have been studies that propose a leader-
based hierarchical scheme for other collectives
A leader is chosen on each node
Only the leader is involved in inter-node
communication
The communication takes place in three stages
The cores aggregate data at the leader of the
node
The leader perform inter-node communication
The leader distributes the data to the cores
We implemented the above scheme for Alltoall as
illistrated in the diagram in next slide
![Page 18: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/18.jpg)
Leader-based Scheme for Alltoall
Node 0 Node 1 Node 0 Node 1
GROUP
Node 0 Node 1 Node 0 Node 1
Step 1
Step 2 Step 3
•step 1: all cores send data to the leader
•step 2: the leader performs alltoall with other leader
•step 3: the leader distributes the respective data to other cores
![Page 19: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/19.jpg)
Issues with Leader-based Scheme
• It uses only one core to send the data out on
the network
• Does not take advantage of increase in
bandwidth with the use of more cores to send
the data out of the node
![Page 20: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/20.jpg)
Presentation Outline•Introduction
•Motivation & Problem Statement
•Proposed Design
•Performance Evaluation
•Conclusion & Future Work
![Page 21: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/21.jpg)
GROUP 2
Proposed Design
21
Node 0 Node 1
Step 1 Step 2
Node 0 Node 1
GROUP 1
Cores
•All the cores take part in the communication
•Each core communicates with one and only one core from other nodes
Node 0 Node 1
•Step 1: Intra-node Communication
•The data destined for other nodes is exchanged among the cores
•The core which communicates with the respective core of the other node
receives the data
•Step 2: Inter-node Communication
•Alltoall is called among each group
![Page 22: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/22.jpg)
Advantages of the Proposed
Scheme
• The scheme takes advantage of low cost
shared memory
• It uses multiple cores to send the data out on
the network, thus achieving better network
utilization
• Each core issues same number of sends as
the leader-based scheme, hence start-up
costs are lower
![Page 23: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/23.jpg)
Presentation Outline•Introduction
•Motivation & Problem Statement
•Proposed Design
•Performance Evaluation
•Conclusion & Future Work
![Page 24: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/24.jpg)
Evaluation Framework
• Testbed– Cluster A: 64 node (512 cores)
• dual 2.33 GHz Intel Xeon “Clovertown” quad-core
• InfiniPath SDR network interface QLE7140
• InfiniHost III DDR network interface card MT25208
– Cluster B: 4 node (32 cores)• dual 2.33 GHz Intel Xeon “Clovertown” quad-core
• Mellanox DDR ConnectX network interface
• Experiments– Alltoall collective time
• Onload InfiniPath network interface
• Offload InfiniHost III network interface
• ConnectX network interface
– CPMD Application performance
![Page 25: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/25.jpg)
Alltoall: InfiniPath
0
5000
10000
15000
20000
25000
30000
1 2 4 8 16 32 64 128 256 512 1K 2K
Tim
e (u
s)
Msg Size
Alltoall Timeoriginal
Leader-based
proposed
•The figure shows the alltoall time for different message size on 512 core system
•Leader-based reduces the alltoall time
•Proposed design gives the best performance on onload network interfaces
![Page 26: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/26.jpg)
Alltoall-InfiniPath: 512 Bytes Message
0
2000
4000
6000
8000
10000
12000
2 4 8 16 32 64
Tim
e (u
s)
# of Nodes
Alltoall Timeoriginal
leader-based
proposed
•The figure shows the alltoall time for 512 Bytes message on varying system size
•The proposed scheme scales much better than other schemes on increase in
system size
![Page 27: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/27.jpg)
Alltoall: InfiniHost III
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
1 2 4 8 16 32 64 128 256 512 1K 2K
Tim
e (u
s)
Msg Size
Alltoall Timeoriginal
Leader-based
proposed
•The figure shows the performance of the schemes on offload network interfaces
•Leader-based scheme performs best on offload NIC as it avoids congestion.
•This matches our expectations
![Page 28: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/28.jpg)
Alltoall: ConnectX
0
500
1000
1500
2000
2500
3000
3500
4000
4500
1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K
Tim
e (u
s)
Msg Size
Alltoall Time
Leader-based
original
proposed
•As seen earlier, bi-directional bandwidth increases with the use of more
cores on ConnectX architecture
•Therfore, the proposed scheme attains the best performance
![Page 29: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/29.jpg)
CPMD Application
0
20
40
60
80
100
120
140
160
180
200
32-wat si63-10ryd si63-70ryd si63-120ryd
original
Leader-based
proposed
Execution T
ime (
sec)
•CPMD is designed for ab-initio molecular dynamics. CPMD makes
extensive use of alltoall communication.
•Figure shows the performance improvement of CPMD Application on
128 core system
•The proposed design delivers the best execution time
![Page 30: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/30.jpg)
CPMD Application Performance on
Varying System Size
0
100
200
300
400
500
600
8X8 16X8 32X8 64X8
Tim
e (s
ecs
)
System Size
CPMDoriginal
Leader-based
proposed
•This figure shows the application execution time on different system sizes.
•The reduction in application execution time increases with increasing system
sizes. Proposed design scales very well.
![Page 31: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/31.jpg)
Presentation Outline•Introduction
•Motivation & Problem Statement
•Proposed Design
•Performance Evaluation
•Conclusion & Future Work
![Page 32: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/32.jpg)
Conclusion & Future Work Interfaces implemented for the same interconnect, exhibit
different network characteristics.
A single collective algorithm does not perform optimally for all network interfaces.
We have proposed an optimized alltoall collective algorithm for multi-core systems connected using modern InfiniBandnetwork interfaces.
The proposed design achieves a reduction in MPI_Alltoalltime by 55% and speeds up the CPMD application by 33%.
We plan to evaluate our designs on 10GigE-based systems.
And also extend the study to other collectives like broadcast and allgather.
![Page 33: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/33.jpg)
33
Web Pointers
http://nowlab.cse.ohio-state.edu/
MVAPICH Web Page
http://mvapich.cse.ohio-state.edu/
![Page 34: Scaling Alltoall Collective on Multi-core Systemsnowlab.cse.ohio-state.edu/static/media/publications/...Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala,](https://reader034.fdocuments.net/reader034/viewer/2022052009/601ec5c2c212a733502c8461/html5/thumbnails/34.jpg)
34
AcknowledgementsOur research is supported by the following organizations
• Current Funding support by
• Current Equipment support by