7/24/2019 Parallel Computing Concepts
1/63
Parallel Programming Concepts
OpenHPI Course
Week 5 : Distributed Memory Parallelism
Unit 5.1: Hardware
Dr. Peter Trger + Teaching Team
7/24/2019 Parallel Computing Concepts
2/63
Summary: Week 4
!
Accelerators enable major speedup for data parallelism
" SIMD execution model (no branching)
" Memory latency managed with many light-weight threads
! Tackle diversity with OpenCL
"
Loop parallelism with index ranges
" Kernels in C, compiled at runtime
" Complex memory hierarchy supported
! Getting fast is easy, getting faster is hard
"
Best practices for accelerators" Hardware knowledge needed
2
What if my computational problem still demands more power?
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
3/63
Parallelism for
!
Speed compute faster
! Throughput compute more in the same time
! Scalability compute faster / more with additional resources
" Huge scalability only with shared nothingsystems
"
Still also depends on application characteristics
Processing Element A1
Processing Element A2
Processing Element A3
Processing Element B1
Processing Element B2
Processing Element B3ScalingUp
Scaling Out
Main
Memory
Main
Memory
3
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
4/63
Parallel Hardware
!
Shared memory system
" Typically a single machine, common address space for tasks
" Hardware scaling is limited (power / memory wall)
! Shared nothing (distributed memory) system
"
Tasks on multiple machines, can only access local memory" Global task coordination by explicit messaging
" Easy scale-out by adding machines to the network
4
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
Processing
Element
Task
Shared Memory
Processing
Element
Task
Processing
Element
Task
Processing
Element
Task
Message
Message
Message
Message
Cache CacheLocal
MemoryLocal
Memory
7/24/2019 Parallel Computing Concepts
5/63
Parallel Hardware
!
Shared memory system !collection of processors
" Integrated machine for capacity computing
" Prepared for a large variety of problems
! Shared-nothing system !collection of computers
"
Clusters and supercomputers for capability computing
" Installation to solve few problems in the best way
" Parallel software must be able leverage multiple machines
at the same time
"
Difference to distributed systems (Internet, Cloud)! Single organizational domain, managed as a whole
! Single parallel application at a time,
no separation of client and server application
! Hybrids are possible (e.g. HPC in Amazon AWS cloud)
5
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
6/63
Shared Nothing: Clusters
!
Collection of stand-alone machines connected by a local network
" Cost-effective technique for a large-scale parallel computer
" Users are builders, have control over their system
" Synchronization much slower than in shared memory
"
Task granularity becomes an issue
6
ProcessingElement
Task
ProcessingElement
Task
Message
Message
Message
Message
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
LocalMemory
LocalMemory
7/24/2019 Parallel Computing Concepts
7/63
Shared Nothing: Supercomputers
!
Supercomputers / Massively Parallel Processing (MPP) systems
" (Hierarchical) cluster with a lot of processors
" Still standard hardware, but specialized setup
" High-performance interconnection network
"
For massive data-parallel applications, mostly simulations(weapons, climate, earthquakes, airplanes, car crashes, ...)
! Examples (Nov 2013)
" BlueGene/Q, 1.5 million cores, 1.5 PB memory, 17.1 TFlops
"
Tianhe-2, 3.1 million cores,1 PB memory, 17.808 kW power,
33.86 PFlops (quadrillions
calculations per second)
! Annual ranking with the TOP500 list
(www.top500.org)
7
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
8/63
Example
8
2011 IBM Corporation
1. Chip:16+2 !P
cores
2. Single Chip Module
4. Node Card:32 Compute Cards,Optical Modules, Link Chips; 5D Torus
5a. Midplane:16 Node Cards
6. Rack: 2 Midplanes
7. System:96 racks, 20PF/s
3. Compute card:One chip module,16 GB DDR3 Memory,Heat Spreader for H2O Cooling
5b. IO drawer:8 IO cards w/16 GB8 PCIe Gen2 x8 slots3D I/O torus
Sustained single node perf: 10x P, 20x L MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria) Software and hardware support for programming modelsfor exploitation of node hardware concurrency
Blue Gene/Q
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
9/63
Interconnection Networks
!
Bus systems
" Static approach, low costs
" Shared communication path,
broadcasting of information
"
Scalability issues with shared bus! Completely connected networks
" Static approach, high costs
" Only direct links, optimal performance
! Star-connected networks
" Static approach with central switch
" Less links, still very good performance
" Scalability depends on central switch
9
PE PE PE PE
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
PE
PE
PE
PE
PE
PE PE
PE
PE
PE
PE
PE
PE
PE PE
PE
Switch
7/24/2019 Parallel Computing Concepts
10/63
Interconnection Networks
!
Crossbar switch
" Dynamic switch-based network
" Supports multiple parallel direct
connections without collisions
"
Less edges than completely connectednetwork, but still scalability issues
! Fat tree
" Use wider links in higher parts of the
interconnect tree
"
Combine tree design advantages with
solution for root node scalability
" Communication distance between any
two nodes is no more than 2 log #PEs
10
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
PE1 PE2 PE3 PEn
PE1
PE2
PE3
PEn
PE PE PE PEPE PE
Switch SwitchSwitch
SwitchSwitch
Switch
7/24/2019 Parallel Computing Concepts
11/63
Interconnection Networks
!
Linear array
! Ring
" Linear array with connected endings
! N-way D-dimensional mesh
"
Matrix of processing elements" Not more than N neighbor links
" Structured in D dimensions
! N-way D-dimensional torus
"
Mesh with wrap-around connection
11
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
PE PE PE PE
PE PE PE PE
PE PE
PE PE
PE PE
PE
PE
PE
PE PE
PE PE
PE PE
PE
PE
PE
4-way 2D torus8-way 2D mesh
4-way 2D mesh
7/24/2019 Parallel Computing Concepts
12/63
Example: Blue Gene/Q 5D Torus
!
5D torus interconnect in Blue Gene/Q supercomputer
" 2 GB/s on all 10 links, 80ns latency to direct neighbors
" Additional link for
communication
with I/O nodes
12
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
[IBM]
7/24/2019 Parallel Computing Concepts
13/63
Parallel Programming Concepts
OpenHPI Course
Week 5 : Distributed Memory Parallelism
Unit 5.2: Granularity and Task Mapping
Dr. Peter Trger + Teaching Team
7/24/2019 Parallel Computing Concepts
14/63
Workload
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
14
! Last week showed that task granularity may be flexible
" Example: OpenCL work group size
! But: Communication overhead becomes significant now
" What is the right level of task granularity ?
7/24/2019 Parallel Computing Concepts
15/63
Surface-To-Volume Effect
!
Envision the work to be done
(in parallel) as sliced 3D cube
" Not a demand on the application
data, just a representation
!
Slicing represents splitting into tasks! Computationalwork of a task
" Proportional to the volumeof the cube slice
" Represents the granularity of decomposition
! Communication requirements of the task
" Proportional to the surfaceof the cube slice
! communication-to-computation ratio
" Fine granularity: Communication high, computation low
" Coarse granularity: Communication low, computation high
15
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
16/63
Surface-To-Volume Effect
16
[nicerweb.com]
!
Fine-grained decomposition forusing all processing elements ?
! Coarse-grained decomposition
to reduce communication
overhead ?
!
A tradeoff question !
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
17/63
Surface-To-Volume Effect
!
Heatmap example with64 data cells
! Version (a): 64 tasks
" 64x4=
256 messages,
256 data values
" 64 processing
elements used in
parallel
!
Version (b): 4 tasks" 16 messages,
64 data values
" 4 processing elements
used in parallel
17
[Foster]
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
18/63
Surface-To-Volume Effect
!
Rule of thumb
" Agglomerate tasks to avoid communication
" Stop when parallelism is no longer exploited well enough
" Agglomerate in all dimensions at the same time
!
Influencing factors" Communication technology + topology
" Serial performance per processing element
" Degree of application parallelism
!
Task communication vs. network topology" Resulting task graphmust be
mapped to network topology
" Task-to-task communication
may need multiple hops
18
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
[Foster]
7/24/2019 Parallel Computing Concepts
19/63
The Task Mapping Problem
!
Given
" a number of homogeneous processing elements
with performance characteristics,
" some interconnection topology of the processing elements
with performance characteristics,
" an application dividable into parallel tasks.
! Questions:
" What is the optimal task granularity ?
" How should the tasks be placed on processing elements ?
"
Do we still get speedup / scale-up by this parallelization ?
! Task mapping is still research, mostly manual tuning today
! More options with configurable networks / dynamic routing
" Reconfiguration of hardware communication paths
19
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
20/63
Parallel Programming Concepts
OpenHPI Course
Week 5 : Distributed Memory Parallelism
Unit 5.3: Programming with MPI
Dr. Peter Trger + Teaching Team
7/24/2019 Parallel Computing Concepts
21/63
Message Passing
!
Parallel programming paradigm for shared nothing environments
" Implementations for shared memory available,
but typically not the best approach
! Users submit their message passing program & data as job
!
Cluster management system creates program instances
Instance0
Instance1
Instance2 Instance
3
Execution Hosts
21
Cluster Management Software
SubmissionHost
Job
Appli-cation
7/24/2019 Parallel Computing Concepts
22/63
Single Program Multiple Data (SPMD)
22
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
// (determine rank and comm_size) int token;
if (rank != 0) {// Receive from your left neighbor if you are not rank
0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",rank, token, rank - 1);
} else {
// Set the token's value if you are rank 0token = -1;
}// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.
if (rank == 0) {MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);
}
Input data
SPMD program
// (determine rank and comm_size) int token;if (rank != 0) {
// Receive from your left neighbor if you are not rank 0MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {// Set the token's value if you are rank 0token = -1;
}// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.if (rank == 0) {MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);}
// (determine rank and comm_size) int token;if (rank != 0) {
// Receive from your left neighbor if you are not rank 0MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {// Set the token's value if you are rank 0token = -1;
}// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.if (rank == 0) {MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);}
// (determine rank and comm_size) int token;if (rank != 0) {
// Receive from your left neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {// Set the token's value if you are rank 0token = -1;
}// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.if (rank == 0) {MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);
}
// (determine rank and comm_size) int token;if (rank != 0) {
// Receive from your left neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {// Set the token's value if you are rank 0token = -1;
}// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.if (rank == 0) {MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);}
// (determine rank and comm_size) int token;if (rank != 0) {
// Receive from your left neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {// Set the token's value if you are rank 0token = -1;
}// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.if (rank == 0) {MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);}
Instance 0
Instance 1
Instance 2
Instance 3
Instance 4
7/24/2019 Parallel Computing Concepts
23/63
Message Passing Interface (MPI)
!
Many optimized messaging libraries for shared nothingenvironments, developed by networking hardware vendors
! Need for standardized API solution: Message Passing Interface
" Definition of API syntax and semantics
"
Enables source code portability, not interoperability" Software independent from hardware concepts
! Fixed number ofprocess instances, defined on startup
" Point-to-point and collective communication
! Focus on efficiency of communication and memory usage
! MPI Forum standard
! Consortium of industry and academia
! MPI 1.0 (1994), 2.0 (1997), 3.0 (2012)
23
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
24/63
MPI Communicators
!
Each application instance (process) has a rank, starting at zero! Communicator: Handle for a group of processes
" Unique rank numbers inside the communicator group
" Instance can determine communicatorsizeand own rank
"
Default communicator MPI_COMM_WORLD" Instance may be in multiple communicator groups
24
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
Rank 0Size 4 Rank 1
Size 4
Rank 2Size 4 Rank 3
Size 4 Communic
ator
7/24/2019 Parallel Computing Concepts
25/63
Communication
!
Point-to-point communication between instancesintMPI_Send(void* buf, int count, MPI_Datatype type,
int destRank, int tag, MPI_Comm com);
intMPI_Recv(void* buf, int count, MPI_Datatype type,
int sourceRank, int tag, MPI_Comm com);
!
Parameters
" Send / receive buffer + size + data type
" Sender provides receiver rank, receiver provides sender rank
" Arbitrary message tag
!
Source / destination identified by [tag, rank, communicator] tuple
! Default send / receive will block until the match occurs
! Useful constants: MPI_ANY_TAG, MPI_ANY_SOURCE, MPI_ANY_DEST
! Variations in the API for different buffering behavior
25
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
26/63
Example: Ring communication
26
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
// (determine rank and comm_size) int token;
if (rank != 0) {
// Receive from your left neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);} else {
// Set the token's value if you are rank 0
token = -1;}
// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,
0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",rank, token, comm_size - 1);
}
[mpitutorial.com]
7/24/2019 Parallel Computing Concepts
27/63
Deadlocks
27Consider:
int a[10], b[10], myrank;MPI_Status status;
...
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if (myrank == 0) {MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD);
MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD);}
else if (myrank == 1) {MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD);
MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD);
}...
If MPI_Sendis blocking, there is a deadlock.
intMPI_Send(void* buf, int count, MPI_Datatype type,int destRank, int tag, MPI_Comm com);
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
28/63
Collective Communication
!
Point-to-point communication vs. collective communication
! Use cases: Synchronization, data distribution & gathering
! All processes in a (communicator) group communicate together
" One sender with multiple receivers (one-to-all)
"
Multiple senders with one receiver (all-to-one)"
Multiple senders and multiple receivers (all-to-all)
! Typical pattern in supercomputer applications
! Participants continue if the group communication is done
"
Always blocking operation" Must be executed by all processes in the group
" No assumptions on the state of other participants on return
28
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
29/63
Barrier
29 !
Communicator members block until everybody reaches the barrier
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
MPI_Barrier(comm) MPI_Barrier(comm) MPI_Barrier(comm)
// (determine rank and comm_size) int token;
if (rank != 0) {// Receive from your left neighbor if you are not rank 0
MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {// Set the token's value if you are rank 0token = -1;
}// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.if (rank == 0) {MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);}
// (determine rank and comm_size) int token;if (rank != 0) {
// Receive from your left neighbor if you are not rank 0MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {// Set the token's value if you are rank 0token = -1;
}// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);}
// (determine rank and comm_size) int token;if (rank != 0) {
// Receive from your left neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {// Set the token's value if you are rank 0token = -1;
}// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);}
// (determine rank and comm_size) int token;if (rank != 0) {
// Receive from your left neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {// Set the token's value if you are rank 0token = -1;
}// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size ,0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.if (rank == 0) {
MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);}
// (determine rank and comm_size) int token;if (rank != 0) {
// Receive from your left neighbor if you are not rank 0MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {// Set the token's value if you are rank 0token = -1;
}// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.if (rank == 0) {MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);}
// (determine rank and comm_size) int token;if (rank != 0) {
// Receive from your left neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",
rank, token, rank - 1);
} else {// Set the token's value if you are rank 0token = -1;
}// Send your local token value to your right neighbor
MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);
// Now rank 0 can receive from the last rank.if (rank == 0) {MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",
rank, token, comm_size - 1);}
7/24/2019 Parallel Computing Concepts
30/63
Broadcast
!
int MPI_Bcast( void *buffer, int count,
MPI_Datatype datatype, int rootRank, MPI_Comm comm )
" rootRank is the rank of the chosen root process
" Root process broadcasts data in buffer to all other processes,
itself included
"
On return, all processes have the same data in their buffer
30
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
Data
Processes
D0
Data
Processes
D0
D0
D0
D0
D0
D0
Broadcast
7/24/2019 Parallel Computing Concepts
31/63
Scatter
!
int MPI_Scatter(void *sendbuf, int sendcnt,
MPI_Datatype sendtype, void *recvbuf, int recvcnt,
MPI_Datatype recvtype, int rootRank, MPI_Comm comm)
" sendbuf buffer on root process is divided, parts are sent
to all processes, including root
" MPI_SCATTERVallows varying count of data per rank
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
31
Data
Processes
D0 D1 D2 D3 D4 D5
Data
Processes
D0
D1
D2
D3
D4
D5
Scatter
Gather
7/24/2019 Parallel Computing Concepts
32/63
Gather
!
int MPI_Gather(void *sendbuf, int sendcnt,MPI_Datatype sendtype, void *recvbuf, int recvcnt,
MPI_Datatype recvtype, int rootRank, MPI_Comm comm)
" Each process (including the root process) sends the data in its
sendbuf buffer to the root process
"
Incoming data in recvbufis stored in rank order"
recvbuf parameter is ignored for all non-root processes
32
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
Data
Processes
D0 D1 D2 D3 D4 D5
Data
Processes
D0
D1
D2
D3
D4
D5
Scatter
Gather
7/24/2019 Parallel Computing Concepts
33/63
Reduction
!
int MPI_Reduce(void *sendbuf, void *recvbuf,int count, MPI_Datatype datatype, MPI_Op op,
int rootRank, MPI_Comm comm)
" Similar to MPI_Gather
" Additional reductionoperation op to aggregate received
data: maximum, minimum, sum, product, boolean
operators, max-min, min-min
! MPI implementation can overlap communication and
reduction calculation for faster results
33
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
Data
Processes D0A
D0B
D0C
Reduce +
Data
Processes D0A+ D0B+ D0C
D0B
D0C
7/24/2019 Parallel Computing Concepts
34/63
Example: MPI_Scatter + MPI_Reduce
34/* -- E. van den Berg 07/10/2001 -- */
#include
#include "mpi.h"
int main (int argc, char *argv[]) {
int data[] = {1, 2, 3, 4, 5, 6, 7}; // Size must be >= #processorsint rank, i = -1, j = -1;
MPI_Init (&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
MPI_Scatter ((void *)data, 1, MPI_INT, (void *)&i ,
1, MPI_INT, 0, MPI_COMM_WORLD);
printf ("[%d] Received i = %d\n", rank, i);
MPI_Reduce ((void *)&i, (void *)&j, 1, MPI_INT, MPI_PROD,
0, MPI_COMM_WORLD);
printf ("[%d] j = %d\n", rank, j);
MPI_Finalize();return 0;
}
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
35/63
What Else
! Variations:
MPI_ISend, MPI_Sendrecv, MPI_Allgather, MPI_Alltoall,
! Definition of virtual topologies for better task mapping
! Complex data types
!
Packing / Unpacking (sprintf / sscanf)
! Group / Communicator Management
! Error Handling
! Profiling Interface
!
Several implementations available
" MPICH - Argonne National Laboratory
" OpenMPI - Consortium of Universities and Industry
" ...
35
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
36/63
Parallel Programming Concepts
OpenHPI Course
Week 5 : Distributed Memory Parallelism
Unit 5.4: Programming with Channels
Dr. Peter Trger + Teaching Team
7/24/2019 Parallel Computing Concepts
37/63
Communicating Sequential Processes
!
Formal process algebra to describe concurrent systems
" Developed by Tony Hoare at University of Oxford (1977)
! Also inventor of QuickSort and Hoare logic
" Computer systems act and interact with the environment
"
Decomposition in subsystems (processes) that operate
concurrently inside the system
" Processes interact with other processes, or the environment
! Book: T. Hoare, Communicating Sequential Processes, 1985
!
A mathematical theory, described with algebraic laws! CSP channel concept available in many programming
languages for shared nothing systems
! Complete approach implemented in Occam language
37
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
38/63
CSP: Processes
!
Behavior of real-world objects can be described through theirinteraction with other objects
" Leave out internal implementation details
" Interface of a process is described as set of atomic events
! Example:ATMand User, both modeled as processes
" cardevent insertion of a credit card in an ATM card slot
" moneyevent extraction of money from the ATM dispenser
! Alphabet- set of relevant events for an object description
" May never happen, interaction is restricted to these events
"
!ATM= !User= {card, money}
! A CSP processis the behavior of an object, described with its
alphabet
38
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
39/63
Communication in CSP
!
Special class of event: Communication" Modeled as unidirectional channel between processes
" Channel name is a member of the alphabets of both processes
" Send activity described by multiple c.v events
!
Channel approach assumes rendezvousbehavior" Sender and receiver block on the channel operation until the
message is transmitted
" Implicit barrier based on communication
! With formal foundation, mathematical proofs are possible
" When two concurrent processes communicate with each other
only over a single channel, they cannot deadlock.
" Network of non-stopping processes which is free of cycles
cannot deadlock.
"
39
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
40/63
Whats the Deal ?
!
Any possible system can be modeled through event chains
" Enables mathematical proofs for deadlock freedom,
based on the basic assumptions of the formalism
(e.g. single channel assumption)
! Some tools available (check readings page)
! CSP was the formal base for the Occam language
" Language constructs follow the formalism
" Mathematical reasoning about the behavior of written code
! Still active research (Welsh University), channel concept
frequently adopted
" CSP channel implementations for Java, MPI, Go, C, Python
" Other formalisms based on CSP, e.g. Task/Channel model
40
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
41/63
Channels in Scala
41actor {var out: OutputChannel[String] = null
val child= actor{
react {
case "go" => out!"hello"
}}
val channel= new Channel[String]
out= channelchild!"go"
channel.receive{case msg => println(msg.length)
}
}
case class ReplyTo(out:OutputChannel[String])
val child= actor{
react {
case ReplyTo(out) => out!"hello"}
}
actor {
val channel= new Channel[String]child!ReplyTo(channel)
channel.receive{
case msg => println(msg.length)
}}
Scope-based channel sharing
Sending channels in messages
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
42/63
Channels in Go
42
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
package mainimport fmt fmtfunc sayHello (ch1 chan string) {
ch1
7/24/2019 Parallel Computing Concepts
43/63
Channels in Go
!
selectconcept allows to switch between available channels" All channels are evaluated
" If multiple can proceed, one is chosen randomly
" Default clause if no channel is available
! Channels are typically first-class language constructs
" Example: Client provides a response channel in the request
! Popular solution to get deterministic behavior
43
select {case v :=
7/24/2019 Parallel Computing Concepts
44/63
Task/Channel Model
!
Computational modelfor multi-computer by Ian Foster!
Similar concepts to CSP
! Parallel computation consists of one or more tasks
" Tasks execute concurrently
"
Number of tasks can vary during execution" Task: Serial programwith localmemory
" A task has in-portsand outportsas interface to the
environment
" Basic actions: Read / write local memory,
send message on outport,
receive message on in-port,
create new task, terminate
44
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
45/63
Task/Channel Model
!
Outport / in-port pairs are connected by channels
" Channels can be created and deleted
" Channels can be referenced as ports,
which can be part of a message
"
Sendoperation is non-blocking" Receiveoperation is blocking
" Messages in a channel stay in order
! Tasks are mappedto physical processors by the execution
environment
"
Multiple tasks can be mapped to one processor
! Data locality is explicit part of the model
! Channels can model controland datadependencies
45
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
46/63
Programming With Channels
!
Channel-only parallel programs have advantages
" Performance optimization does not influence semantics
! Example: Shared-memory channels for some parts
" Task mapping does not influence semantics
!
Align number of tasks for the problem,not for the execution environment
! Improves scalability of implementation
" Modular design with well-defined interfaces
!
Communication should be balanced between tasks!
Each task should only communicate with a small group of
neighbors
46
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
47/63
Parallel Programming Concepts
OpenHPI Course
Week 5 : Distributed Memory Parallelism
Unit 5.5: Programming with Actors
Dr. Peter Trger + Teaching Team
7/24/2019 Parallel Computing Concepts
48/63
Actor Model
!
Carl Hewitt, Peter Bishop and Richard Steiger. A Universal ModularActor Formalism for Artificial Intelligence IJCAI 1973.
" Mathematical model for concurrent computation
" Actor as computational primitive
! Local decisions, concurrently sends / receives messages
! Has a mailboxfor incoming messages
! Concurrently creates more actors
" Asynchronous one-way message sending
" Changing topology allowed, typically no order guarantees
!
Recipient is identified by mailing address
! Actors can send their own identity to other actors
! Available as programming language extension or library
in many environments
48
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
49/63
Erlang Ericsson Language
!
Functional language with actor support
! Designed for large-scale concurrency
" First version in 1986 by Joe Armstrong, Ericsson Labs
" Available as open source since 1998
!
Language goals driven by Ericsson product development" Scalable distributed execution of phone call handling software
with large number of concurrent activities
" Fault-tolerant operation under timing constraints
" Online software update
! Users
" Amazon EC2 SimpleDB , Delicious, Facebook chat, T-Mobile
SMS and authentication, Motorola call processing, Ericsson
GPRS and 3G mobile network products, CouchDB, EJabberD,
49
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
50/63
Concurrency in Erlang
!
Concurrency Oriented Programming" Actor processes are completely independent (shared nothing)
" Synchronization and data exchange with message passing
" Each actor process has an unforgeable name
"
If you know the name, you can send a message" Default approach is fire-and-forget
" You can monitor remote actor processes
! Using this gives you
" Opportunity for massive parallelism
" No additional penalty for distribution, despite latency issues
" Easier fault tolerance capabilities
" Concurrency by default
50
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
51/63
Actors in Erlang
!
Communication via message passing is part of the language! Send never fails, works asynchronously (PID ! Message)
! Actors have mailboxfunctionality
" Queue of received messages, selective fetching
" Only messages from same source arrive in-order
" receivestatement with set of clauses, pattern matching
" Process is suspended in receive operation until a match
receive
Pattern1 when Guard1 -> expr1, expr2, ..., expr_n;Pattern2 when Guard2 -> expr1, expr2, ..., expr_n;
Other -> expr1, expr2, ..., expr_n
end
51
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
52/63
Functions exported + #args
Erlang Example: Ping Pong Actors
52
Start Pingand Pongactors
Blocking recursive receive,scanning the mailbox
Pingactor,sending message to Pong
Blocking recursive receive,scanning the mailbox
Sending message to Ping
[erlang.org]
-module(tut15).-export([test/0, ping/2, pong/0]).
ping(0, Pong_PID) ->Pong_PID ! finished,
io:format("Ping finished~n", []);
ping(N, Pong_PID) ->
Pong_PID ! {ping, self()},
receivepong ->io:format("Ping received pong~n", [])
end,
ping(N - 1, Pong_PID).
pong() ->
receivefinished ->
io:format("Pong finished~n", []);{ping, Ping_PID} ->
io:format("Pong received ping~n", []),
Ping_PID ! pong,pong()
end.
test() ->
Pong_PID = spawn(tut15, pong, []),
spawn(tut15, ping, [3, Pong_PID]).
Pongactor
7/24/2019 Parallel Computing Concepts
53/63
Actors in Scala
!
Actor-based concurrency in Scala, similar to Erlang
! Concurrency abstraction on top of threads or processes
! Communication by non-blocking send operation and blocking
receive operation with matching functionality
actor{
var sum = 0
loop{
receive{
caseData(bytes) => sum += hash(bytes)
caseGetSum(requester) => requester ! sum
}}}
! All constructs are library functions(actor, loop, receiver, !)
! Alternative self.receiveWithin()call with timeout
! Case classes act as message type representation
53
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
54/63
Case classes,acting as message types
Start the counteractor
Scala Example: Counter Actor
54
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
import scala.actors.Actorimport scala.actors.Actor._case class Inc(amount: Int)
case class Value
class Counterextends Actor {var counter: Int = 0;def act() = {
while (true) {
receive{case Inc(amount)=>
counter += amountcase Value =>
println("Value is "+counter)exit()
}}}}
object ActorTest extends Application {
val counter = new Countercounter.start()for (i
7/24/2019 Parallel Computing Concepts
55/63
Actor Deadlocks
55!
Synchronous send operator !? available in Scala" Sends a message and blocks in receive afterwards
" Intended for request-response pattern
! Original asynchronous send makes deadlocks less probable
[http://savanne.be/articles/concurrency-in-erla
ng-scala/]// actorA
actorB!?Msg1(value) match {case Response1(r) =>
// }
receive{caseMsg2(value) =>reply(Response2(value))
}
// actorBactorA!?Msg2(value) match {
case Response2(r) =>
// }
receive{caseMsg1(value) =>reply(Response1(value))
}
// actorAactorB!Msg1(value)while (true) {
receive{caseMsg2(value) =>
reply(Response2(value))case Response1(r) =>
// ...
}}
// actorBactorA!Msg2(value)while (true) {
receive{caseMsg1(value) =>
reply(Response1(value))case Response2(r) =>// ...
}}
7/24/2019 Parallel Computing Concepts
56/63
Parallel Programming Concepts
OpenHPI Course
Week 5 : Distributed Memory Parallelism
Unit 5.6: Programming with MapReduce
Dr. Peter Trger + Teaching Team
7/24/2019 Parallel Computing Concepts
57/63
MapReduce
!
Programming model for parallel processing of large data sets
" Inspired by map()and reduce()in functional programming
" Intended for best scalability in data parallelism
! Huge interest started with Google Research publication
"
Jeffrey Dean and Sanjay Ghemawat.MapReduce: Simplified Data Processing on Large Clusters
" Google products rely on internal implementation
! Apache Hadoop: Widely known open source implementation
" Scales to thousands of nodes
" Has shown to process petabytes of data
" Cluster infrastructure with custom file system (HDFS)
! Parallel programming on very high abstraction level
57
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
58/63
MapReduce Concept
!
Map step
" Convert input tuples
[key, value] with map()
function into one / multiple
intermediate tuples
[key2, value2]per input
! Shufflestep: Collect all intermediate tuples with the same key
! Reducestep
" Combine all intermediate tuples with the same key by
some reduce()function to one result per key!
Developer just defines stateless map() and reduce()functions
! Framework automatically ensures parallelization
! Persistence layer needed for input and output only
58
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
[developers.google.com]
7/24/2019 Parallel Computing Concepts
59/63
Example: Character Counting
59
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
60/63
Java Example: Hadoop Word Count
60
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
public classWordCount{
public static classMapextends MapReduceBaseimplementsMapper {
private final static IntWritable one = new IntWritable(1);private Text word = new Text();
publicvoidmap(LongWritablekey, Textvalue,OutputCollector output, Reporter reporter)
throws IOException {String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);}}}
public static classReduceextends MapReduceBase
implementsReducer {public void reduce(Text key, Iterator values,
OutputCollector output, Reporter reporter)throws IOException {int sum = 0;
while (values.hasNext()) {sum += values.next().get();
}output.collect(key, new IntWritable(sum));
}}...}
[h
adoop.apache.org]
7/24/2019 Parallel Computing Concepts
61/63
MapReduce Data Flow
61
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
[developer.yahoo.com]
7/24/2019 Parallel Computing Concepts
62/63
Advantages
!
Developer never implements communication or synchronization,implicitly done by the framework
" Allows transparent fault tolerance and optimization
! Running mapand reducetasks are stateless
"
Only rely on their input, produce their own output"
Repeated execution in case of failing nodes
" Redundant execution for compensating nodes with different
performance characteristics
! Scale-out only limited by
"
Distributed file system performance (input / output data)
" Shuffle step communication performance
! Chaining of map/reduce tasks is very common in practice
! But: Demands embarrassingly parallel problem
62
OpenHPI | Parallel Programming Concepts | Dr. Peter Trger
7/24/2019 Parallel Computing Concepts
63/63
Summary: Week 5
!
Shared nothing systems provide very good scalability" Adding new processing elements not limited by walls
" Different options for interconnect technology
! Task granularity is essential
"
Surface-to-volume effect" Task mapping problem
! De-facto standard is MPI programming
! High level abstractions with
" Channels
" Actors
" MapReduce
63
What steps / strategy would you applyto parallelize a given compute-intense program?
Top Related