C OpenMP - cc.u-tokyo.ac.jp · C OpenMP 1. OpenMP OpenMP Architecture Review Board ARB
Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio...
Transcript of Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio...
![Page 2: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/2.jpg)
GPGPU Heterogeneous programming
› We need a programming model that provides1. Simple and generic offloading subroutines
2. An easy way to write code which runs on thousand threads (might bearranged in clusters)
3. A way to exploit the NUMA hierarchy (if any)
2
Core
Core
(Host)Memory
PCI
EXPRESS
CHIPSET
Accelerator
![Page 3: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/3.jpg)
Offload-based programming
3
Device
Host
Kernel #0
Kernel #1
Kernel #N
Kernel #0
Kernel #…
› Move data (to/from) device
› Offload computation
› (Sync)
› OpenMP 4.*– (OpenMP 5)
![Page 4: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/4.jpg)
Move DATA (and code)
4
![Page 5: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/5.jpg)
target data constructs
› Map variables to a device data environment for the extent of the region
› The binding set is the generating task
5
#pragma omp target data [clause [[,]clause]...] new-line
structured-block
Where clauses can be:
if([ target :] scalar-expression)
device(integer-expression)
map([[map-type-modifier[,]] map-type: ] list)
is_device_ptr(list)
![Page 6: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/6.jpg)
target data constructs
› Map variables to a device data environment for the extent of the region
› Stand-alone directive
› The binding set is the generating task
6
#pragma omp target [enter|exit] data [clause [[,]clause]...] new-line
Where clauses can be:
if([ target :] scalar-expression)
device(integer-expression)
map([[map-type-modifier[,]] map-type: ] list)
depend(dependency-type: list)
nowait
![Page 7: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/7.jpg)
target update constructs
› Makes the corresponding list items in the device data environment, consistent with their original list items, according to the specified motion clauses
› Stand-alone directive
› The binding set is the generating task
7
#pragma omp target update [clause [[,]clause]...] new-line
Where clauses can be:
if([ target update :] scalar-expression)
device(integer-expression)
depend(dependency-type: list)
nowait
![Page 8: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/8.jpg)
declare target construct
› Map variables device data environment
› The binding set is the generating task
8
#pragma omp declare target new-line
declaration-definition-seq
#pragma omp declare target end new-line
![Page 9: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/9.jpg)
Threading
![Page 10: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/10.jpg)
target construct
› Map variables to a device data environment for the extent of the region
› And executes a target task on the region
› The binding set is the generating task10
#pragma omp target [clause [[,]clause]...] new-line
structured-block
Where clauses can be:
if([ target :] scalar-expression)
device(integer-expression)
private(list)
firstprivate(list)
map([[map-type-modifier[,]] map-type: ] list)
is_device_ptr(list)
defaultmap(tofrom:scalar)
nowait
depend(dependence-type: list)
![Page 11: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/11.jpg)
teams constructs
› Creates a league of thread teams and the master thread of each team executes the region.
UNLIKE PRAGMA OMP PARALLEL!!
› Executes on device
› Only master thread executes11
#pragma omp teams [clause [[,]clause]...] new-line
structured-block
Where clauses can be:
num_teams(integer-expression)
thread_limit(integer-expression)
default(shared|none)
private(list)
firstprivate(list)
shared(list)
reduction(reduction-identifier: list)
![Page 12: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/12.jpg)
Device
CUDA-based programming
› Esposed in the programming model
› Based on the concepts of– Grid(s)
– Block(s)
– Thread(s)
Parallel Programming LM – 2016/17 12
Host
Kernel #0
Kernel #1
Kernel #N
Grid #0
Block(0,0)
Block(1,0)
Block(2,0)
Block(0,1)
Block(1,1)
Block(2,1)
Grid #1
Block(0)
Block(1)
Block(2)
Block (1,1)
Thread(0,0)
Thread(1,0)
Thread(2,0)
Thread(0,1)
Thread(1,1)
Thread(2,1)
Thread(3,0)
Thread(4,0)
Thread(3,1)
Thread(4,1)
Thread(0,2)
Thread(1,2)
Thread(2,2)
Thread(3,2)
Thread(4,2)
Block (1,0)
Thread(0)
Thread(1)
Thread(2)
Thread(3)
Thread(4)
…
Kernel #0
Grid #1
![Page 13: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/13.jpg)
distribute constructs
› specifies that the iterations of one or more loops will be executed by the thread teams in the context of their implicit tasks.
13
#pragma omp distribute [clause [[,]clause]...] new-line
for-loops
Where clauses can be:
private(list)
firstprivate(list)
lastprivatefirstprivate(list)
collapse(n)
dist_schedule(kind[, chunk_size})
![Page 14: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/14.jpg)
distribute simd constructs
› Specifies that the iterations of one or more loops will be executed by the thread teams in the context of their implicit tasks.
› The loop will be executed in SIMD fashion (GPGPUs…)14
#pragma omp distribute simd [clause [[,]clause]...] new-line
for-loops
Where clauses can be:
private(list)
firstprivate(list)
lastprivatefirstprivate(list)
collapse(n)
dist_schedule(kind[, chunk_size})
![Page 15: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/15.jpg)
distribute parallel for [simd]
› Same as previous, but threads can cross the boundaries of teams
WAT??
› Can run on multiple clusters
› …and on SIMD / GPU fashion
15
![Page 16: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/16.jpg)
![Page 17: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/17.jpg)
…different devices
17
#pragma omp target device(0) map(tofrom:B)
#pragma omp parallel for
for (int i=0; i<N; i++)
B[i] += sin(B[i]);
#pragma omp target device(0) map(tofrom:B)
#pragma omp teams num_teams (num_blocks) num_threads (bsize)
#pragma omp distribute
for (int i=0; i<N; i+= bsize )
#pragma omp parallel for firstprivate (i)
for (b = i; b < i+ bsize; b++)
B[b] += sin(B[b]);
Intel Xeon Phi
NVIDIA GP-GPU
![Page 18: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/18.jpg)
NVIDIA GPGPU
18
#pragma omp target device(0) map(tofrom:B)
#pragma omp teams num_teams (num_blocks) num_threads (bsize)
#pragma omp distribute
for (int i=0; i<N; i+= bsize )
#pragma omp parallel for firstprivate (i)
for (b = i; b < i+ bsize; b++)
B[b] += sin(B[b]);
NVIDIA GP-GPU
![Page 19: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/19.jpg)
Intel Xeon Phi
19
#pragma omp target device(0) map(tofrom:B)
#pragma omp parallel for
for (int i=0; i<N; i++)
B[i] += sin(B[i]);
Intel Xeon Phi
![Page 20: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/20.jpg)
![Page 21: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/21.jpg)
HERO: Open-Source Heterogeneous Research Platform
![Page 22: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/22.jpg)
PULP - An Open Parallel Ultra-Low-Power Processing-Platform
This is a joint project between the Integrated Systems Laboratory (IIS) of ETH Zurich and the Energy-
efficient Embedded Systems (EEES) group of UNIBO to develop an open, scalable Hardware and
Software research platform with the goal to break the pJ/op barrier within a power envelope of a few
mW.
The PULP platform is a multi-core platform achieving leading-edge energy-efficiency and featuring
widely-tunable performance.
open-source
cluster-based
scalable
silicon-proven
RISC-V
![Page 23: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/23.jpg)
PULP systems in HPC: Multi-cluster PULP accelerators
![Page 24: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/24.jpg)
Multi-Cluster PULP as an Accelerator
PMCA-managed IOMMU
Main challenges:
• Programmability
• Zero-copy data sharing (pointer passing) via Shared Virtual Memory
![Page 25: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/25.jpg)
HERO’s Hardware Architecture
![Page 26: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/26.jpg)
HERO is modifiable and expandable
▪ Up to 8 clusters, each with 8 cores
▪ All components are open source and written in System Verilog
▪ Standard interfaces (mostly AXI)
▪ New components can easily be added to the memory map
TLX-400
So
C B
usMailbox
L2
Mem
Cluster 0
L1 Mem
Cluster 1
L1 Mem
Cluster L-1
L1 MemRAB
X-Bar Interconnect
Clu
ste
r B
us
DMA
L1
SP
M
Ba
nk
M-1
RISC-V
PE N-1
Shared L1 I$L
1 S
PM
Ba
nk
0
L1
SP
M
Ba
nk
1
L1
SP
M
Ba
nk
2RISC-V
PE 1
RISC-V
PE 0
Pe
rip
he
ral
Bu
s
TRYX
Per2AXI
AXI2Per
Tim er
Event Unit
TRYX TRYX
Shared APU
DEMUX DEMUX DEMUX
# of clusters∈ {1, 2, 4, 8}
# of PEs∈ {2, 4, 8}
FPU∈ { private, shared (APU), of f}
integer DSP unit∈ { private, shared (APU)}
L1 SPM size and # of banks
I$ design, size, # of banks
L2 SPM size
system-levelinterconnecttopology
RAB L1 TLB sizeand L2 TLB size,
associativity,and # of banks
![Page 27: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/27.jpg)
▲De-facto standard for shared memory programming
▲Support for nested (multi-level) parallelism → good for clusters
▲Annotations to incrementally convey parallelism to the compiler → increased
ease of use
▲Based on well-understood programming practices (shared memory, C
language) → increases productivity
“OpenCL for programming shared memory multicore CPUs” by Akhtar Ali , Usman Dastgeer , Christoph
Kessler
At the moment GCC supports OpenMP
offloading ONLY to:
• Intel Xeon Phi
• Nvidia PTX (only through OpenACC)
![Page 28: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/28.jpg)
OpenMP target example
The compiler outlines the code within the target region and generates a binary
version for each accelerator (multi-ISA)
The runtime libraries are in charge to:
▪ manage the accelerator devices
▪ map the variables
▪ run/wait execution of target regions
void vec_mult()
{
double p[N], v1[N], v2[N];
# pragma omp target map(to: v1, v2)\
map(from: p)
{
# pragma omp parallel for
for (int i = 0; i < N; i++)
p[i] = v1[i] * v2[i];
}
}
1. Initialize target device
2. Offload target image
3. Map TO the device mem
4. Trigger execution target region
5. Wait termination
6. Map FROM the device mem
![Page 29: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/29.jpg)
▪ Build the HERO SDK from the sources (takes around 2h)▪ $ git clone --recursive https://github.com/pulp-platform/hero-sdk.git
▪ $ cd hero-sdk && ./hero-z-7045-builder -A
▪ Or Use HERO-VM (Ubuntu 16.04 VM) – No need installation or build▪ User: hero-vm (Password: hero)
▪ Configure the environment on the HERO-VM▪ Open Terminal
▪ $ cd hero-sdk
▪ $ source setup.sh
Install and configure the HERO SDK (Linux)
![Page 30: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/30.jpg)
Install and configure the HERO SDK (Linux) - 2
setup.sh MUST be sourced on every shell you open for HERO development!
![Page 31: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/31.jpg)
▪ HERO Board Setup▪ You do not need to setup the board. It is already available at the IP provided during
the class.
▪ Additional Resources▪ HERO SDK Github README (https://github.com/pulp-platform/hero-sdk)
▪ HERO Website (https://pulp-platform.org/hero.html)
▪ HERO additional HOW-TO (https://pulp-platform.org/hero/doc/)
Install and configure the HERO SDK (Linux) - 3
![Page 32: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/32.jpg)
Let’s say Hello, Word!
▪ 1. Write your helloworld.c
![Page 33: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/33.jpg)
Let’s say Hello, Word! - 2
▪ 1. Write your helloworld.c
▪ 2. Create a MakefileCSRCS = helloworld.c
CFLAGS=
-include
${HERO_OMP_EXAMPLE_DIR}/../common/default.mk
![Page 34: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/34.jpg)
Let’s say Hello, Word! - 3
▪ 1. Write your helloworld.c
▪ 2. Create a Makefile
▪ 3. Build
![Page 35: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/35.jpg)
▪ 1. Write your helloworld.c
▪ 2. Create a Makefile
▪ 3. Build
▪ 4. Open UART from the device. This
is required when you want to use
print from the accelerator device
▪ Open new Terminal
▪ $ cd hero-sdk && source setup.sh
▪ $ ssh <host>@<board-ip>
▪ $ cd /mnt/storage/apps
▪ $ ./uart
Let’s say Hello, Word! - 4
![Page 36: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/36.jpg)
▪ 1. Write your helloworld.c
▪ 2. Create a Makefile
▪ 3. Build
▪ 4. Open UART from the device. This is
required when you want to use print from the
accelerator device
▪ Open new Terminal
▪ $ cd hero-sdk && source setup.sh
▪ $ ssh <host>@<board-ip>
▪ $ cd /mnt/storage/apps
▪ $ ./uart
▪ 5. Run the example!
Let’s say Hello, Word! - 5
Some UART char could be missing…
![Page 37: Heterogeneous programing with OpenMP · Heterogeneous programing with OpenMP Paolo Burgio paolo.burgio@unimore.it. GPGPU Heterogeneous programming ›We need a programming model that](https://reader034.fdocuments.net/reader034/viewer/2022042622/5f8819797b228f60842c34af/html5/thumbnails/37.jpg)
References
› "HPC" website
– http://algo.ing.unimo.it/people/andrea/Didattica/HPC/
› My contacts
– http://hipert.mat.unimore.it/people/paolob/
› Useful links
– http://www.openmp.org
– http://www.google.com
37