ORSTEN HOEFLER Remote Memory Access Programming: Faster...
Transcript of ORSTEN HOEFLER Remote Memory Access Programming: Faster...
![Page 1: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/1.jpg)
spcl.inf.ethz.ch
@spcl_eth
TORSTEN HOEFLER
Remote Memory Access Programming: Faster
Parallel Computing Without Messages with S. Ramos, R. Gerstenberger, M. Besta, R. Belli @ SPCL
presented at the School of Comp. Science and Engineering, Georgia Tech, June 2015
![Page 2: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/2.jpg)
spcl.inf.ethz.ch
@spcl_eth
My dream: provably optimal performance (time and energy)
From problem to machine code
How to get there?
Model-based Performance Engineering!
1. Design a system model
2. Define your problem
3. Find (close-to) optimal solution in model → prove
4. Implement, test, refine if necessary
Will demonstrate techniques & insights
And obstacles
RMA as a solution?
Motivation & Goals
2
![Page 3: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/3.jpg)
spcl.inf.ethz.ch
@spcl_eth
Example: Message Passing, Log(G)P
CACM 1996
Optimal
Solution
Broadcast
Problem
3 D. Culler et al.: LogP: A Practical Model of Parallel Computation, Communication of th ACM, Nov. 1996
![Page 4: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/4.jpg)
spcl.inf.ethz.ch
@spcl_eth
Hardware Reality
POWER 7, 8 cores, source: IBM Xeon Phi, 64 cores, source: Intel Interlagos, 8/16 cores, source: AMD
4
![Page 5: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/5.jpg)
spcl.inf.ethz.ch
@spcl_eth
Hardware Reality
POWER 7, 8 cores, source: IBM Xeon Phi, 64 cores, source: Intel Interlagos, 8/16 cores, source: AMD
InfiniBand, sources: Intel, Mellanox BG/Q, Cray Aries, sources: IBM, Cray Kepler GPU, source: NVIDIA
5
![Page 6: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/6.jpg)
spcl.inf.ethz.ch
@spcl_eth
Hardware Reality
POWER 7, 8 cores, source: IBM Xeon Phi, 64 cores, source: Intel Interlagos, 8/16 cores, source: AMD
InfiniBand, sources: Intel, Mellanox BG/Q, Cray Aries, sources: IBM, Cray Kepler GPU, source: NVIDIA
6
![Page 7: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/7.jpg)
spcl.inf.ethz.ch
@spcl_eth
Example: Cache-Coherent Communication
Source: Wikipedia
7
![Page 8: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/8.jpg)
spcl.inf.ethz.ch
@spcl_eth
Xeon Phi (Rough) Architecture
8
![Page 9: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/9.jpg)
spcl.inf.ethz.ch
@spcl_eth
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 9
![Page 10: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/10.jpg)
spcl.inf.ethz.ch
@spcl_eth
Local read: RL= 8.6 ns
Remote read RR = 235 ns
Invalid read RI = 278 ns
Inspired by Molka et al.: “Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system” 10
![Page 11: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/11.jpg)
spcl.inf.ethz.ch
@spcl_eth
Assume single cache line forms a Tree
We choose d levels and kj children in level j
Reachable threads:
Example: d=2, k1=3, k2=2:
Designing Broadcast Algorithms
c = DTD contention
b = transmit latency
12
![Page 12: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/12.jpg)
spcl.inf.ethz.ch
@spcl_eth
Broadcast example:
Finding the Optimal Broadcast Algorithm
Bcast cost Number of
levels Reached
threads
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 13
![Page 13: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/13.jpg)
spcl.inf.ethz.ch
@spcl_eth
Example:
T0 + T1 write CL
T2 polls for completion
Min-Max Modeling
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 14
![Page 14: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/14.jpg)
spcl.inf.ethz.ch
@spcl_eth
Example:
T0 + T1 write CL
T2 polls for completion
Min-Max Modeling
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 15
![Page 15: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/15.jpg)
spcl.inf.ethz.ch
@spcl_eth
Example:
T0 + T1 write CL
T2 polls for completion
Min-Max Modeling
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 16
![Page 16: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/16.jpg)
spcl.inf.ethz.ch
@spcl_eth
Example:
T0 + T1 write CL
T2 polls for completion
Min-Max Modeling
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 17
![Page 17: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/17.jpg)
spcl.inf.ethz.ch
@spcl_eth
Example:
T0 + T1 write CL
T2 polls for completion
Min-Max Modeling
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 18
![Page 18: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/18.jpg)
spcl.inf.ethz.ch
@spcl_eth
Example:
T0 + T1 write CL
T2 polls for completion
Min-Max Modeling
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 19
![Page 19: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/19.jpg)
spcl.inf.ethz.ch
@spcl_eth
Example:
T0 + T1 write CL
T2 polls for completion
Min-Max Modeling
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 20
![Page 20: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/20.jpg)
spcl.inf.ethz.ch
@spcl_eth
Example:
T0 + T1 write CL
T2 polls for completion
Min-Max Modeling
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 21
![Page 21: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/21.jpg)
spcl.inf.ethz.ch
@spcl_eth
Example:
T0 + T1 write CL
T2 polls for completion
Min-Max Modeling
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 22
![Page 22: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/22.jpg)
spcl.inf.ethz.ch
@spcl_eth
Example:
T0 + T1 write CL
T2 polls for completion
Min-Max Modeling
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 23
![Page 23: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/23.jpg)
spcl.inf.ethz.ch
@spcl_eth
Example:
T0 + T1 write CL
T2 polls for completion
Min-Max Modeling
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 24
![Page 24: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/24.jpg)
spcl.inf.ethz.ch
@spcl_eth
Example:
T0 + T1 write CL
T2 polls for completion
Min-Max Modeling
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 25
![Page 25: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/25.jpg)
spcl.inf.ethz.ch
@spcl_eth
Example:
T0 + T1 write CL
T2 polls for completion
Min-Max Modeling
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 26
![Page 26: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/26.jpg)
spcl.inf.ethz.ch
@spcl_eth
Example:
T0 + T1 write CL
T2 polls for completion
Min-Max Modeling
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 27
![Page 27: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/27.jpg)
spcl.inf.ethz.ch
@spcl_eth
Example:
T0 + T1 write CL
T2 polls for completion
Min-Max Modeling
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 28
![Page 28: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/28.jpg)
spcl.inf.ethz.ch
@spcl_eth
Example:
T0 + T1 write CL
T2 polls for completion
Min-Max Modeling
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 29
![Page 29: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/29.jpg)
spcl.inf.ethz.ch
@spcl_eth
Example:
T0 + T1 write CL
T2 polls for completion
Min-Max Modeling
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 30
![Page 30: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/30.jpg)
spcl.inf.ethz.ch
@spcl_eth
Example:
T0 + T1 write CL
T2 polls for completion
Min-Max Modeling
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 31
![Page 31: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/31.jpg)
spcl.inf.ethz.ch
@spcl_eth
Small Broadcast (8 Bytes)
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 32
![Page 32: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/32.jpg)
spcl.inf.ethz.ch
@spcl_eth
Barrier
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 33
![Page 33: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/33.jpg)
spcl.inf.ethz.ch
@spcl_eth
Small Reduction
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’13 34
![Page 34: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/34.jpg)
spcl.inf.ethz.ch
@spcl_eth
Rigorous modeling has large potential
Coming with great cost (working on tool support [1])
Understanding cache-coherent communication
performance is incredibly complex (but fun)!
Many states, min-max modeling, NUMA, …
Have models for Sandy Bridge now (QPI, worse!)
Cache coherence really gets in our way here
Complicates modeling and is expensive
Obvious question: why do we need cache coherence?
Answer: well, we don’t, if we program right!
Lessons learned
[1]: Calotoiu et al.: Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes, SC13
[2]: Gerstenberger et al.: Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided, SC13, Best Paper
35
![Page 35: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/35.jpg)
spcl.inf.ethz.ch
@spcl_eth
The de-facto programming model: MPI-1
Using send/recv messages and collectives
The de-facto network standard: RDMA, SHM
Zero-copy, user-level, os-bypass, fuzz-bang
COMMUNICATION IN TODAY’S HPC SYSTEMS
36
![Page 36: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/36.jpg)
spcl.inf.ethz.ch
@spcl_eth
MPI-1 MESSAGE PASSING – SIMPLE EAGER
[1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06 37
![Page 37: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/37.jpg)
spcl.inf.ethz.ch
@spcl_eth
MPI-1 MESSAGE PASSING – SIMPLE EAGER
[1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06 38
![Page 38: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/38.jpg)
spcl.inf.ethz.ch
@spcl_eth
MPI-1 MESSAGE PASSING – SIMPLE EAGER
[1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06 39
![Page 39: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/39.jpg)
spcl.inf.ethz.ch
@spcl_eth
Critical path: 1 latency + 1 copy
MPI-1 MESSAGE PASSING – SIMPLE EAGER
[1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06 40
![Page 40: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/40.jpg)
spcl.inf.ethz.ch
@spcl_eth
MPI-1 MESSAGE PASSING – SIMPLE RENDEZVOUS
[1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06 41
![Page 41: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/41.jpg)
spcl.inf.ethz.ch
@spcl_eth
MPI-1 MESSAGE PASSING – SIMPLE RENDEZVOUS
[1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06 42
![Page 42: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/42.jpg)
spcl.inf.ethz.ch
@spcl_eth
MPI-1 MESSAGE PASSING – SIMPLE RENDEZVOUS
[1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06 43
![Page 43: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/43.jpg)
spcl.inf.ethz.ch
@spcl_eth
MPI-1 MESSAGE PASSING – SIMPLE RENDEZVOUS
[1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06 44
![Page 44: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/44.jpg)
spcl.inf.ethz.ch
@spcl_eth
MPI-1 MESSAGE PASSING – SIMPLE RENDEZVOUS
[1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06 45
![Page 45: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/45.jpg)
spcl.inf.ethz.ch
@spcl_eth
Critical path: 3 latencies
MPI-1 MESSAGE PASSING – SIMPLE RENDEZVOUS
[1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06 46
![Page 46: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/46.jpg)
spcl.inf.ethz.ch
@spcl_eth
The de-facto programming model: MPI-1
Using send/recv messages and collectives
The de-facto hardware standard: RDMA
Zero-copy, user-level, os-bypass, fuzz bang
COMMUNICATION IN TODAY’S HPC SYSTEMS
http://www.hpcwire.com/2006/08/18/a_critique_of_rdma-1/ 47
![Page 47: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/47.jpg)
spcl.inf.ethz.ch
@spcl_eth
Why not use these RDMA features more directly?
A global address space may simplify programming
… and accelerate communication
… and there could be a widely accepted standard
MPI-3 RMA (“MPI One Sided”) was born
Just one among many others (UPC, CAF, …)
Designed to react to hardware trends, learn from others
Direct (hardware-supported) remote access
New way of thinking for programmers
[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
REMOTE MEMORY ACCESS PROGRAMMING
48
“Traditionally, HPC programming
models are following hardware
developments” (IPDPS’15)
![Page 48: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/48.jpg)
spcl.inf.ethz.ch
@spcl_eth
MPI-3 updates RMA (“MPI One Sided”)
Significant change from MPI-2
Communication is „one sided” (no involvement of destination)
Utilize direct memory access
RMA decouples communication & synchronization
Fundamentally different from message passing
MPI-3 RMA SUMMARY
Proc A Proc B
send
recv
Proc A Proc B
put
two sided one sided
Communication Communication
+
Synchronization Synchronization sync
[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf 49
![Page 49: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/49.jpg)
spcl.inf.ethz.ch
@spcl_eth
MPI-3 RMA COMMUNICATION OVERVIEW
Process A (passive)
Memory
MPI window
Process B (active)
Process C (active)
Put
Get Atomic
Non-atomic
communication
calls (put, get)
Atomic communication calls
(Acc, Get & Acc, CAS, FAO)
Memory
MPI window
… Process D (active)
…
Gropp, Hoelfer, Thakur, Lusk: Using Advanced MPI 50
![Page 50: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/50.jpg)
spcl.inf.ethz.ch
@spcl_eth
MPI-3 RMA COMMUNICATION OVERVIEW
Process A (passive)
Memory
MPI window
Process B (active)
Process C (active)
Put
Get Atomic
Non-atomic
communication
calls (put, get)
Atomic communication calls
(Acc, Get & Acc, CAS, FAO)
Memory
MPI window
… Process D (active)
…
Gropp, Hoelfer, Thakur, Lusk: Using Advanced MPI 51
![Page 51: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/51.jpg)
spcl.inf.ethz.ch
@spcl_eth
MPI-3 RMA COMMUNICATION OVERVIEW
Process A (passive)
Memory
MPI window
Process B (active)
Process C (active)
Put
Get Atomic
Non-atomic
communication
calls (put, get)
Atomic communication calls
(Acc, Get & Acc, CAS, FAO)
Memory
MPI window
… Process D (active)
…
Gropp, Hoelfer, Thakur, Lusk: Using Advanced MPI 52
![Page 52: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/52.jpg)
spcl.inf.ethz.ch
@spcl_eth
MPI-3 RMA COMMUNICATION OVERVIEW
Process A (passive)
Memory
MPI window
Process B (active)
Process C (active)
Put
Get Atomic
Non-atomic
communication
calls (put, get)
Atomic communication calls
(Acc, Get & Acc, CAS, FAO)
Memory
MPI window
… Process D (active)
…
Gropp, Hoelfer, Thakur, Lusk: Using Advanced MPI 53
![Page 53: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/53.jpg)
spcl.inf.ethz.ch
@spcl_eth
MPI-3 RMA COMMUNICATION OVERVIEW
Process A (passive)
Memory
MPI window
Process B (active)
Process C (active)
Put
Get Atomic
Non-atomic
communication
calls (put, get)
Atomic communication calls
(Acc, Get & Acc, CAS, FAO)
Memory
MPI window
… Process D (active)
…
Gropp, Hoelfer, Thakur, Lusk: Using Advanced MPI 54
![Page 54: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/54.jpg)
spcl.inf.ethz.ch
@spcl_eth
MPI-3 RMA SYNCHRONIZATION OVERVIEW
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
Gropp, Hoelfer, Thakur, Lusk: Using Advanced MPI 55
![Page 55: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/55.jpg)
spcl.inf.ethz.ch
@spcl_eth
MPI-3 RMA SYNCHRONIZATION OVERVIEW
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
Gropp, Hoelfer, Thakur, Lusk: Using Advanced MPI 56
![Page 56: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/56.jpg)
spcl.inf.ethz.ch
@spcl_eth
MPI-3 RMA SYNCHRONIZATION OVERVIEW
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
Gropp, Hoelfer, Thakur, Lusk: Using Advanced MPI 57
![Page 57: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/57.jpg)
spcl.inf.ethz.ch
@spcl_eth
MPI-3 RMA SYNCHRONIZATION OVERVIEW
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
Gropp, Hoelfer, Thakur, Lusk: Using Advanced MPI 58
![Page 58: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/58.jpg)
spcl.inf.ethz.ch
@spcl_eth
MPI-3 RMA SYNCHRONIZATION OVERVIEW
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
Gropp, Hoelfer, Thakur, Lusk: Using Advanced MPI 59
![Page 59: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/59.jpg)
spcl.inf.ethz.ch
@spcl_eth
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
Window creation
Communication Synchronization
Scalable & generic protocols
Can be used on any RDMA network (e.g., OFED/IB)
Window creation, communication and synchronization
Gerstenberger, Besta, Hoefler: Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided, SC13 60
![Page 60: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/60.jpg)
spcl.inf.ethz.ch
@spcl_eth
Scalable & generic protocols
Can be used on any RDMA network (e.g., OFED/IB)
Window creation, communication and synchronization
foMPI, a fully functional MPI-3 RMA implementation
DMAPP: lowest-level networking API for Cray Gemini/Aries systems
XPMEM, a portable Linux kernel module
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI 61
![Page 61: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/61.jpg)
spcl.inf.ethz.ch
@spcl_eth
Scalable & generic protocols
Can be used on any RDMA network (e.g., OFED/IB)
Window creation, communication and synchronization
foMPI, a fully functional MPI-3 RMA implementation
DMAPP: lowest-level networking API for Cray Gemini/Aries systems
XPMEM, a portable Linux kernel module
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI 62
![Page 62: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/62.jpg)
spcl.inf.ethz.ch
@spcl_eth
Scalable & generic protocols
Can be used on any RDMA network (e.g., OFED/IB)
Window creation, communication and synchronization
foMPI, a fully functional MPI-3 RMA implementation
DMAPP: lowest-level networking API for Cray Gemini/Aries systems
XPMEM: a portable Linux kernel module
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI 63
![Page 63: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/63.jpg)
spcl.inf.ethz.ch
@spcl_eth
PERFORMANCE INTER-NODE: LATENCY
Put Inter-Node Get Inter-Node
20% faster
80% faster
Proc 0 Proc 1
put
sync memory
Half ping-pong
Gerstenberger, Besta, Hoefler: Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided, SC13 64
![Page 64: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/64.jpg)
spcl.inf.ethz.ch
@spcl_eth
PERFORMANCE INTRA-NODE: LATENCY
Put/Get Intra-Node 3x faster
Proc 0 Proc 1
put
sync memory
Half ping-pong
Gerstenberger, Besta, Hoefler: Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided, SC13 65
![Page 65: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/65.jpg)
spcl.inf.ethz.ch
@spcl_eth
PART 3: SYNCHRONIZATION
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
67
![Page 66: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/66.jpg)
spcl.inf.ethz.ch
@spcl_eth
SCALABLE FENCE PERFORMANCE
Time bound
Memory bound
90%
faster
Gerstenberger, Besta, Hoefler: Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided, SC13 68
![Page 67: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/67.jpg)
spcl.inf.ethz.ch
@spcl_eth
Guarantees remote completion
Performs a remote bulk synchronization and an x86 mfence
One of the most performance critical functions, we add only 78 x86
CPU instructions to the critical path
FLUSH SYNCHRONIZATION Time bound
Memory bound
Process 0 Process 2
inc(counter)
0 counter:
…
inc(counter)
inc(counter)
Gerstenberger, Besta, Hoefler: Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided, SC13 69
![Page 68: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/68.jpg)
spcl.inf.ethz.ch
@spcl_eth
Guarantees remote completion
Performs a remote bulk synchronization and an x86 mfence
One of the most performance critical functions, we add only 78 x86
CPU instructions to the critical path
FLUSH SYNCHRONIZATION Time bound
Memory bound
Process 0 Process 2
inc(counter)
3 counter:
…
inc(counter)
inc(counter)
flush
Gerstenberger, Besta, Hoefler: Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided, SC13 70
![Page 69: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/69.jpg)
spcl.inf.ethz.ch
@spcl_eth
PERFORMANCE MODELING
Fence
PSCW
Locks
Put/get
Atomics
Performance functions for synchronization protocols
Performance functions for communication protocols
Gerstenberger, Besta, Hoefler: Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided, SC13 71
![Page 70: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/70.jpg)
spcl.inf.ethz.ch
@spcl_eth
Evaluation on Blue Waters System
22,640 computing Cray XE6 nodes
724,480 schedulable cores
All microbenchmarks
4 applications
One nearly full-scale run
APPLICATION PERFORMANCE
72
![Page 71: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/71.jpg)
spcl.inf.ethz.ch
@spcl_eth
PERFORMANCE: APPLICATIONS
NAS 3D FFT [1] Performance MILC [2] Application Execution Time
Annotations represent performance gain of foMPI [3] over Cray MPI-1.
[1] Nishtala et al.: Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. IPDPS’09
[2] Shan et al.: Accelerating applications at scale using one-sided communication. PGAS’12
[3] Gerstenberger, Besta, Hoefler: Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided, SC13
scale
to 512k procs scale
to 65k procs
74
![Page 72: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/72.jpg)
spcl.inf.ethz.ch
@spcl_eth
Available in most MPI libraries today
Some are even fast!
75
MPI-3 RMA Summary
How to implement producer/consumer in passive mode?
IN CASE YOU WANT TO LEARN MORE
![Page 73: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/73.jpg)
spcl.inf.ethz.ch
@spcl_eth
Most important communication idiom
Some examples:
Perfectly supported by MPI-1 Message Passing
But how does this actually work over RDMA?
PRODUCER-CONSUMER RELATIONS
Belli, Hoefler: Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization, IPDPS’15 76
![Page 74: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/74.jpg)
spcl.inf.ethz.ch
@spcl_eth
ONE SIDED – PUT + SYNCHRONIZATION
Belli, Hoefler: Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization, IPDPS’15 77
![Page 75: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/75.jpg)
spcl.inf.ethz.ch
@spcl_eth
ONE SIDED – PUT + SYNCHRONIZATION
Belli, Hoefler: Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization, IPDPS’15 78
![Page 76: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/76.jpg)
spcl.inf.ethz.ch
@spcl_eth
ONE SIDED – PUT + SYNCHRONIZATION
Belli, Hoefler: Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization, IPDPS’15 79
![Page 77: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/77.jpg)
spcl.inf.ethz.ch
@spcl_eth
ONE SIDED – PUT + SYNCHRONIZATION
Critical path: 3 latencies
Belli, Hoefler: Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization, IPDPS’15 80
![Page 78: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/78.jpg)
spcl.inf.ethz.ch
@spcl_eth
COMPARING APPROACHES
Message Passing
1 latency + copy /
3 latencies
One Sided
3 latencies
Belli, Hoefler: Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization, IPDPS’15 81
![Page 79: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/79.jpg)
spcl.inf.ethz.ch
@spcl_eth
First seen in Split-C (1992)
Combine communication and
synchronization using RDMA
RDMA networks can provide
various notifications
Flags
Counters
Event Queues
IDEA: RMA NOTIFICATIONS
Belli, Hoefler: Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization, IPDPS’15 82
![Page 80: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/80.jpg)
spcl.inf.ethz.ch
@spcl_eth
Message Passing
1 latency + copy /
3 latencies
COMPARING APPROACHES
One Sided
3 latencies
Notified Access
1 latency
Belli, Hoefler: Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization, IPDPS’15 83
![Page 81: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/81.jpg)
spcl.inf.ethz.ch
@spcl_eth
Message Passing
1 latency + copy /
3 latencies
COMPARING APPROACHES
One Sided
3 latencies
Notified Access
1 latency
But how to notify?
84
![Page 82: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/82.jpg)
spcl.inf.ethz.ch
@spcl_eth
Flags (polling at the remote side)
Used in GASPI, DMAPP, NEON
Disadvantages
Location of the flag chosen at the sender side
Consumer needs at least one flag for every process
Polling a high number of flags is inefficient
PREVIOUS WORK: OVERWRITING INTERFACE
85
![Page 83: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/83.jpg)
spcl.inf.ethz.ch
@spcl_eth
Atomic counters (accumulate notifications → scalable)
Used in Split-C, LAPI, SHMEM - Counting Puts, …
Disadvantages
Dataflow applications may require many counters
High polling overhead to identify accesses
Does not preserve order (may not be linearizable)
PREVIOUS WORK: COUNTING INTERFACE
86
![Page 84: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/84.jpg)
spcl.inf.ethz.ch
@spcl_eth
WHAT IS A GOOD NOTIFICATION INTERFACE?
Scalable to yotta-scale
Does memory or polling overhead grow with # of processes?
Computation/communication overlap
Do we support maximum asynchrony? (better than MPI-1)
Complex data flow graphs
Can we distinguish between different accesses locally?
Can we avoid starvation?
What about load balancing?
Ease-of-use
Does it use standard mechanisms?
87
![Page 85: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/85.jpg)
spcl.inf.ethz.ch
@spcl_eth
Notifications with MPI-1 (queue-based) matching
Retains benefits of previous notification schemes
Poll only head of queue
Provides linearizable semantics
OUR APPROACH: NOTIFIED ACCESS
Belli, Hoefler: Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization, IPDPS’15 88
![Page 86: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/86.jpg)
spcl.inf.ethz.ch
@spcl_eth
Minor interface evolution
Leverages MPI two sided <source, tag> matching
Wildcards matching with FIFO semantics
NOTIFIED ACCESS – AN MPI INTERFACE
Example Communication Primitives
Example Synchronization Primitives
Belli, Hoefler: Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization, IPDPS’15 89
![Page 87: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/87.jpg)
spcl.inf.ethz.ch
@spcl_eth
Minor interface evolution
Leverages MPI two sided <source, tag> matching
Wildcards matching with FIFO semantics
NOTIFIED ACCESS – AN MPI INTERFACE
,
,
Example Communication Primitives
Example Synchronization Primitives
Belli, Hoefler: Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization, IPDPS’15 90
![Page 88: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/88.jpg)
spcl.inf.ethz.ch
@spcl_eth
foMPI – a fully functional MPI-3 RMA implementation
Runs on newer Cray machines (Aries, Gemini)
DMAPP: low-level networking API for Cray systems
XPMEM: a portable Linux kernel module
Implementation of Notified Access via uGNI [1]
Leverages uGNI queue semantics
Adds unexpected queue
Uses 32-bit immediate value to encode source and tag
NOTIFIED ACCESS - IMPLEMENTATION
Computing Node 1
Proc
A
Proc
C
Proc
D
Proc
B
Computing Node 2
Proc
E
Proc
G
Proc
H
Proc
F
XPMEM(intra node
communication)
DMAPP(inter node non-notified
communication)
uGNI(inter node notified
communication)
[1] http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI_NA/ 91
![Page 89: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/89.jpg)
spcl.inf.ethz.ch
@spcl_eth
Piz Daint
Cray XC30, Aries interconnect
5'272 computing nodes (Intel Xeon E5-2670 + NVIDIA Tesla K20X)
Theoretical Peak Performance 7.787 Petaflops
Peak Network Bisection Bandwidth 33 TB/s
EXPERIMENTAL SETTING
[1] http://www.cscs.ch 92
![Page 90: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/90.jpg)
spcl.inf.ethz.ch
@spcl_eth
1000 repetitions, each timed separately, RDTSC timer
95% confidence interval always within 1% of median
PING PONG PERFORMANCE (INTER-NODE)
(lower is better)
Belli, Hoefler: Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization, IPDPS’15 93
![Page 91: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/91.jpg)
spcl.inf.ethz.ch
@spcl_eth
1000 repetitions, each timed separately, RDTSC timer
95% confidence interval always within 1% of median
COMPUTATION/COMMUNICATION OVERLAP
Uses communication
progression thread
(lower is better)
Belli, Hoefler: Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization, IPDPS’15 95
![Page 92: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/92.jpg)
spcl.inf.ethz.ch
@spcl_eth
1000 repetitions, each timed separately, RDTSC timer
95% confidence interval always within 1% of median
PIPELINE – ONE-TO-ONE SYNCHRONIZATION
[1] https://github.com/intelesg/PRK2
(lower is better)
96
![Page 93: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/93.jpg)
spcl.inf.ethz.ch
@spcl_eth
Reduce as an example (same for FMM, BH, etc.)
Small data (8 Bytes), 16-ary tree
1000 repetitions, each timed separately with RDTSC
REDUCE – ONE-TO-MANY SYNCHRONIZATION
(lower is better)
97
![Page 94: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/94.jpg)
spcl.inf.ethz.ch
@spcl_eth
1000 repetitions, each timed separately, RDTSC timer
95% confidence interval always within 10% of median
CHOLESKY – MANY-TO-MANY SYNCHRONIZATION
[1]: J. Kurzak, H. Ltaief, J. Dongarra, R. Badia: "Scheduling dense linear algebra operations on multicore processors“, CCPE 2010
(Higher is better)
98
![Page 95: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/95.jpg)
spcl.inf.ethz.ch
@spcl_eth
Performance of cache-coherency is hard to model
Min/max models
RDMA+SHM are de-facto hardware mechanisms
Gives rise to RMA programming
MPI-3 RMA standardizes clear semantics
Builds on existing practice (UPC, CAF, ARMCI etc.)
Rich set of synchronization mechanisms
Notified Access can support producer/consumer
Maintains benefits of RDMA
Fully parameterized LogGP-like performance model
Aids algorithm development and reasoning
DISCUSSION AND CONCLUSIONS
99
applicable at least to:
![Page 96: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/96.jpg)
spcl.inf.ethz.ch
@spcl_eth
ACKNOWLEDGMENTS
100
![Page 97: ORSTEN HOEFLER Remote Memory Access Programming: Faster ...spcl.inf.ethz.ch/Publications/.pdf/hoefler-phi+rma-gatech.pdf · Remote Memory Access Programming: Faster Parallel Computing](https://reader033.fdocuments.net/reader033/viewer/2022050513/5f9d2672b6f87274104e70cd/html5/thumbnails/97.jpg)
spcl.inf.ethz.ch
@spcl_eth
ACKNOWLEDGMENTS
101