Advanced Computing M.A. Oliveira Topics in Advanced...
Transcript of Advanced Computing M.A. Oliveira Topics in Advanced...
AdvancedComputing
M.A. Oliveira
OutlineTopics in Advanced Computing
Miguel Afonso Oliveira
Laboratorio de Instrumentacao e Fısica Experimental de Partıculas
LIP
Instituto de Telecomunicacoes - Polo de Coimbra
AdvancedComputing
M.A. Oliveira
Outline
Part I:Computing
Part II:Platforms
Part III: Design
Part IV:Efficiency
Part V:Performance
Part VI:OpenMP
Part VII: MoreOpenMP
Part VIII: Introto MPI
Part IX: MPI
Part X: MoreMPI
Part XI: Hybrid
Part XII:Development
Part XIII:Libraries
Part XIV:Compilers
Part XV:Debug & Prof
Part XVI:Resources
Outline
AdvancedComputing
M.A. Oliveira
Outline
Part I:Computing
Part II:Platforms
Part III: Design
Part IV:Efficiency
Part V:Performance
Part VI:OpenMP
Part VII: MoreOpenMP
Part VIII: Introto MPI
Part IX: MPI
Part X: MoreMPI
Part XI: Hybrid
Part XII:Development
Part XIII:Libraries
Part XIV:Compilers
Part XV:Debug & Prof
Part XVI:Resources
Part I: Introduction to Parallel Computing
1 Science and ComputersClassical Scientific MethodModern Scientific MethodThe Grand Challenges in Science
2 Motivating ParallelismComputational Power ArgumentMemory/Disk Speed ArgumentThe Data Communication Argument
3 Parallel ComputingWhat is Parallel Computing?Why do Parallel Computing?
4 Message
AdvancedComputing
M.A. Oliveira
Outline
Part I:Computing
Part II:Platforms
Part III: Design
Part IV:Efficiency
Part V:Performance
Part VI:OpenMP
Part VII: MoreOpenMP
Part VIII: Introto MPI
Part IX: MPI
Part X: MoreMPI
Part XI: Hybrid
Part XII:Development
Part XIII:Libraries
Part XIV:Compilers
Part XV:Debug & Prof
Part XVI:Resources
Part II: Parallel Programing Platforms
5 Parallel ComputersFlynn’s Taxonomy: Early daysFlynn’s Taxonomy: RecentlyMemory model Taxonomy
6 Parallel ProgrammingThe Two Extreme ModelsParallel Programming: The Real World
AdvancedComputing
M.A. Oliveira
Outline
Part I:Computing
Part II:Platforms
Part III: Design
Part IV:Efficiency
Part V:Performance
Part VI:OpenMP
Part VII: MoreOpenMP
Part VIII: Introto MPI
Part IX: MPI
Part X: MoreMPI
Part XI: Hybrid
Part XII:Development
Part XIII:Libraries
Part XIV:Compilers
Part XV:Debug & Prof
Part XVI:Resources
Part III: Principles of Parallel Algorithm Design
7 The Task/Channel Model
8 Foster’s Design Methodology
9 Message
AdvancedComputing
M.A. Oliveira
Outline
Part I:Computing
Part II:Platforms
Part III: Design
Part IV:Efficiency
Part V:Performance
Part VI:OpenMP
Part VII: MoreOpenMP
Part VIII: Introto MPI
Part IX: MPI
Part X: MoreMPI
Part XI: Hybrid
Part XII:Development
Part XIII:Libraries
Part XIV:Compilers
Part XV:Debug & Prof
Part XVI:Resources
Part IV: Computer Architecture and Efficiency
10 Introduction11 Memory Hierarchy
IntroductionCachesVirtual Memory
12 Designing for memory hierarchiesIntroductionTemporal locality
Inline small functionsif-then-elsecompact loops
Spacial localityLoop index orderBlockingBlock reorderingDynamic data structures
13 Loop unrolling14 Library usage
AdvancedComputing
M.A. Oliveira
Outline
Part I:Computing
Part II:Platforms
Part III: Design
Part IV:Efficiency
Part V:Performance
Part VI:OpenMP
Part VII: MoreOpenMP
Part VIII: Introto MPI
Part IX: MPI
Part X: MoreMPI
Part XI: Hybrid
Part XII:Development
Part XIII:Libraries
Part XIV:Compilers
Part XV:Debug & Prof
Part XVI:Resources
Part V: Performance Analysis
15 Introduction
16 Speedup and Efficiency
17 Amdahl’s law
18 Gustafson-Barsis’ Law
19 Summary on speedup limits
20 The Karp-Flatt metric
AdvancedComputing
M.A. Oliveira
Outline
Part I:Computing
Part II:Platforms
Part III: Design
Part IV:Efficiency
Part V:Performance
Part VI:OpenMP
Part VII: MoreOpenMP
Part VIII: Introto MPI
Part IX: MPI
Part X: MoreMPI
Part XI: Hybrid
Part XII:Development
Part XIII:Libraries
Part XIV:Compilers
Part XV:Debug & Prof
Part XVI:Resources
Part VI: OpenMP
21 IntroductionDefinitionAdvantages and Disadvantages of OpenMPArchitectureComponentsExecution ModelThread Communication
22 OpenMPSyntax and SentinelsParallel Control DirectivesCombined DirectivesData EnvironmentDirective Clauses
AdvancedComputing
M.A. Oliveira
Outline
Part I:Computing
Part II:Platforms
Part III: Design
Part IV:Efficiency
Part V:Performance
Part VI:OpenMP
Part VII: MoreOpenMP
Part VIII: Introto MPI
Part IX: MPI
Part X: MoreMPI
Part XI: Hybrid
Part XII:Development
Part XIII:Libraries
Part XIV:Compilers
Part XV:Debug & Prof
Part XVI:Resources
Part VII: More OpenMP
23 Synchronisation
24 Runtime libraries
25 Environment variables
26 Compiling and Running an OpenMP programmeCompilingRunning
AdvancedComputing
M.A. Oliveira
Outline
Part I:Computing
Part II:Platforms
Part III: Design
Part IV:Efficiency
Part V:Performance
Part VI:OpenMP
Part VII: MoreOpenMP
Part VIII: Introto MPI
Part IX: MPI
Part X: MoreMPI
Part XI: Hybrid
Part XII:Development
Part XIII:Libraries
Part XIV:Compilers
Part XV:Debug & Prof
Part XVI:Resources
Part VIII: MPI
27 MPI
28 Definition
29 Advantages and Disadvantages
30 Fundamentals
31 Basic MPI
32 Example
33 Compiling and Running an MPI programCompilingRunning
AdvancedComputing
M.A. Oliveira
Outline
Part I:Computing
Part II:Platforms
Part III: Design
Part IV:Efficiency
Part V:Performance
Part VI:OpenMP
Part VII: MoreOpenMP
Part VIII: Introto MPI
Part IX: MPI
Part X: MoreMPI
Part XI: Hybrid
Part XII:Development
Part XIII:Libraries
Part XIV:Compilers
Part XV:Debug & Prof
Part XVI:Resources
Part IX: MPI
34 Collective CommunicationSynchronizationData MovementAdvanced Data Movement PrimitivesData movement with computation
35 Point to Point CommunicationDeadlockUnidirectional point to point communicationsBidirectional point to point communicationsAvoid Deadlock
AdvancedComputing
M.A. Oliveira
Outline
Part I:Computing
Part II:Platforms
Part III: Design
Part IV:Efficiency
Part V:Performance
Part VI:OpenMP
Part VII: MoreOpenMP
Part VIII: Introto MPI
Part IX: MPI
Part X: MoreMPI
Part XI: Hybrid
Part XII:Development
Part XIII:Libraries
Part XIV:Compilers
Part XV:Debug & Prof
Part XVI:Resources
Part X: More MPI
36 Derived DatatypesBasic TypesDerived Datatypes
37 Communicators and Groups
38 Topologies
39 Wildcards
40 Timing
41 MPI I/OIntroductionI/O types in MPI programsParallel MPI I/O to a single file
AdvancedComputing
M.A. Oliveira
Outline
Part I:Computing
Part II:Platforms
Part III: Design
Part IV:Efficiency
Part V:Performance
Part VI:OpenMP
Part VII: MoreOpenMP
Part VIII: Introto MPI
Part IX: MPI
Part X: MoreMPI
Part XI: Hybrid
Part XII:Development
Part XIII:Libraries
Part XIV:Compilers
Part XV:Debug & Prof
Part XVI:Resources
Part XI: Hybrid programming
42 Hybrid Programming
43 Distributed and Shared Memory Systems
44 Why Hybrid Computing
45 Modes of Hybrid ComputingTasks and ThreadsHybrid CodingTypesMPI InitializationFunneled ModeSerialized ModeMulti-Threaded Mode
46 Overlapping Communication and Work
47 Thread-rank Communication
AdvancedComputing
M.A. Oliveira
Outline
Part I:Computing
Part II:Platforms
Part III: Design
Part IV:Efficiency
Part V:Performance
Part VI:OpenMP
Part VII: MoreOpenMP
Part VIII: Introto MPI
Part IX: MPI
Part X: MoreMPI
Part XI: Hybrid
Part XII:Development
Part XIII:Libraries
Part XIV:Compilers
Part XV:Debug & Prof
Part XVI:Resources
Part XII: Introduction to Software DevelopmentTools
48 Introduction to Software Development Tools
AdvancedComputing
M.A. Oliveira
Outline
Part I:Computing
Part II:Platforms
Part III: Design
Part IV:Efficiency
Part V:Performance
Part VI:OpenMP
Part VII: MoreOpenMP
Part VIII: Introto MPI
Part IX: MPI
Part X: MoreMPI
Part XI: Hybrid
Part XII:Development
Part XIII:Libraries
Part XIV:Compilers
Part XV:Debug & Prof
Part XVI:Resources
Part XIII: Libraries
49 LibrariesWhy Parallel Libraries?Performance LibrariesOptimized LibrariesCommon HPC Libraries
AdvancedComputing
M.A. Oliveira
Outline
Part I:Computing
Part II:Platforms
Part III: Design
Part IV:Efficiency
Part V:Performance
Part VI:OpenMP
Part VII: MoreOpenMP
Part VIII: Introto MPI
Part IX: MPI
Part X: MoreMPI
Part XI: Hybrid
Part XII:Development
Part XIII:Libraries
Part XIV:Compilers
Part XV:Debug & Prof
Part XVI:Resources
Part XIV: Compiler Optimization
50 Compiler OptimizationIntroductionOptimization LevelsIntel Compiler OptionsPGI Compiler OptionsConclusions
AdvancedComputing
M.A. Oliveira
Outline
Part I:Computing
Part II:Platforms
Part III: Design
Part IV:Efficiency
Part V:Performance
Part VI:OpenMP
Part VII: MoreOpenMP
Part VIII: Introto MPI
Part IX: MPI
Part X: MoreMPI
Part XI: Hybrid
Part XII:Development
Part XIII:Libraries
Part XIV:Compilers
Part XV:Debug & Prof
Part XVI:Resources
Part XV: Debugging and Profiling
51 Debugging and ProfilingStandard DebuggersDebugging BasicsCommercial Debuggers
52 Analyses ToolsGoalsCompiler Reports & ListingsTimersProfilersHardware Performance Counters
AdvancedComputing
M.A. Oliveira
Outline
Part I:Computing
Part II:Platforms
Part III: Design
Part IV:Efficiency
Part V:Performance
Part VI:OpenMP
Part VII: MoreOpenMP
Part VIII: Introto MPI
Part IX: MPI
Part X: MoreMPI
Part XI: Hybrid
Part XII:Development
Part XIII:Libraries
Part XIV:Compilers
Part XV:Debug & Prof
Part XVI:Resources
Part XVI: HPC Available Resources
53 HPC Available ResourcesHPC Available Resources: PortugalHPC Available Resources: Europe
AdvancedComputing
M.A. Oliveira
Science andComputers
MotivatingParallelism
ParallelComputing
Message
Part I
Introduction to Parallel Computing
AdvancedComputing
M.A. Oliveira
Science andComputers
ClassicalScientificMethod
ModernScientificMethod
The GrandChallenges inScience
MotivatingParallelism
ParallelComputing
Message
Science and Computers
AdvancedComputing
M.A. Oliveira
Science andComputers
ClassicalScientificMethod
ModernScientificMethod
The GrandChallenges inScience
MotivatingParallelism
ParallelComputing
Message
Classical Science
Nature
Observa,on
Theory Experimenta,on
AdvancedComputing
M.A. Oliveira
Science andComputers
ClassicalScientificMethod
ModernScientificMethod
The GrandChallenges inScience
MotivatingParallelism
ParallelComputing
Message
Modern Science
Nature
Observa,on
Theory Experimenta,on Numerical Simula,on
AdvancedComputing
M.A. Oliveira
Science andComputers
ClassicalScientificMethod
ModernScientificMethod
The GrandChallenges inScience
MotivatingParallelism
ParallelComputing
Message
The Grand Challenges in Science
Quantum Chemistry, Statistical Mechanics and RelativisticPhysics
Cosmology and Astrophysics
Computational Fluid Dynamics and Turbulance
Material Design and Superconductivity
Biology, Pharmacology, Genome Sequencing, GeneticEngineering, Protein Folding, Enzyme Activity and CellModeling
Medicine and Modeling of Human Organs and Bones
Global Weather and Environmental Modeling
AdvancedComputing
M.A. Oliveira
Science andComputers
MotivatingParallelism
ComputationalPowerArgument
Memory/DiskSpeedArgument
The DataCommunicationArgument
ParallelComputing
Message
Motivating Parallelism
AdvancedComputing
M.A. Oliveira
Science andComputers
MotivatingParallelism
ComputationalPowerArgument
Memory/DiskSpeedArgument
The DataCommunicationArgument
ParallelComputing
Message
Computational Power Argument
Moore’s Law: Circuit complexity doubles every eighteenmonths.
Generalized Moore’s Law: The amount of computing poweravailable at a given cost doubles every eighteen months.
Critical Issue:
How do we translate transistors into use useful operationsper second?How do we use the transistors to achieve increasing rates ofcomputation?
Logical Answer:
Rely on parallelism (implicit and explicit).
AdvancedComputing
M.A. Oliveira
Science andComputers
MotivatingParallelism
ComputationalPowerArgument
Memory/DiskSpeedArgument
The DataCommunicationArgument
ParallelComputing
Message
Memory/Disk Speed Argument
Gap between processor speed and memory/disk presents atremendous performance bottleneck.
Usual compromise is to use hierarchy of caches...
Parallelism can also help because it yields bettermemory/disk system management:
larger aggregate caches.higher aggregate bandwidth.
AdvancedComputing
M.A. Oliveira
Science andComputers
MotivatingParallelism
ComputationalPowerArgument
Memory/DiskSpeedArgument
The DataCommunicationArgument
ParallelComputing
Message
The Data Communication Argument
Networking infrastructure evolved tremendously.
Why not use it as an heterogenous parallel/distributedcomputing environment?
Voluntary Computing
SETI@HOME
IBERCIVIS...
Grid Computing
WLCG...
AdvancedComputing
M.A. Oliveira
Science andComputers
MotivatingParallelism
ParallelComputing
What?
Why?
Message
Parallel Computing
AdvancedComputing
M.A. Oliveira
Science andComputers
MotivatingParallelism
ParallelComputing
What?
Why?
Message
What is Parallel Computing?
Parallel Computing: use of multiple processing units orcomputers for a common task.
Each processing unit works on its section of the problem.
Processing units can exchange information.
PU_2 works on thisare of the problem
PU_3 works on thisare of the problem
PU_4 works on thisare of the problem
PU_1 works on thisare of the problem
AdvancedComputing
M.A. Oliveira
Science andComputers
MotivatingParallelism
ParallelComputing
What?
Why?
Message
Why do Parallel Computing?
To compute beyond the limits of single PU systems:
achieve more performance;utilize more memory.
To be able to:
solve that can’t be solved in a reasonable time with singlePU systems;solve problems that don’t fit on a single PU system or evena single system.
So we can:
solve larger problems;solve problems faster;solve more problems.
AdvancedComputing
M.A. Oliveira
Science andComputers
MotivatingParallelism
ParallelComputing
Message
Message
AdvancedComputing
M.A. Oliveira
Science andComputers
MotivatingParallelism
ParallelComputing
Message
Message
If you can’t compute youcan’t compete!
AdvancedComputing
M.A. Oliveira
Taxonomy
Programming
Part II
Parallel Programing Platforms
AdvancedComputing
M.A. Oliveira
Taxonomy
Early days
Recently
Currently
Programming
Parallel Computers
AdvancedComputing
M.A. Oliveira
Taxonomy
Early days
Recently
Currently
Programming
Early days: Flynn’s Taxonomy
AdvancedComputing
M.A. Oliveira
Taxonomy
Early days
Recently
Currently
Programming
Recently: Flynn’s Taxonomy
SPMD
MIMD
Single Program Multiple Data Multiple Program Multiple Data
MPMD
AdvancedComputing
M.A. Oliveira
Taxonomy
Early days
Recently
Currently
Programming
Memory Model Taxonomy
Mem
_N
Parallel Computers
Shared Memory Distributed Memory
CPU_0
CPU_1
CPU_N
Inte
rcon
nec
t
Mem
ory
Inte
rcon
nec
t
CPU_0
CPU_1
CPU_N
Mem
_0
Mem
_1
AdvancedComputing
M.A. Oliveira
Taxonomy
Programming
The Models
Reality
Parallel Programming
AdvancedComputing
M.A. Oliveira
Taxonomy
Programming
The Models
Reality
The Two Extreme Parallel Programming Models
MPI
Parallel Computers
Distributed MemoryShared Memory
Shared MemoryProgramming
OpenMP
Message−PassingProgramming
The distributed model can be used directly on a sharedmemory system.
Using the shared memory model on a distributed memorysystem is only possible indirectly.
Both models can be combined to optimize performance.
AdvancedComputing
M.A. Oliveira
Taxonomy
Programming
The Models
Reality
Parallel Computers and Parallel Programming: TheReal World
RDMA
Homogeneous Systems Heterogeneous Systems
Parallel Programming
Distributed Partinioned
Global Address
Distributed
Memory
Shared Memory CPU+GPU CPU+Coprocessor
CPU+FPGA
Space − DGAS
Global Address
Space − PGAS
PVM
MPI
threads
OpenMP
Software
Hardware
OperatingSystem
Library/Language
Myrinet
InfinibandCluster
OpenMP
Intel
CAF
HPF
UPC
Chapel
IBM X10
SUN Fortress
CUDA
OpenCL
AdvancedComputing
M.A. Oliveira
Model
Methodology
Message
Part III
Principles of Parallel Algorithm Design
AdvancedComputing
M.A. Oliveira
Model
Methodology
Message
The Task/Channel Model
AdvancedComputing
M.A. Oliveira
Model
Methodology
Message
The Task/Channel Model
Parallel Computation: set of tasks interacting with eachother by sending messages through channels.
Task: it’s a program, it’s local memory and a collection ofI/O ports.
Communication: tasks can send local data values to othertasks via output ports and receive data values from othertasks via input ports.
Channel: it’s a message queue connecting one task’soutput port to another task’s input port.
Synchronicity: Message reception is synchronous (receivermust wait for message to arrive - blocking). Sending isasynchronous (sender never blocks).
AdvancedComputing
M.A. Oliveira
Model
Methodology
Message
The Task/Channel Model
AdvancedComputing
M.A. Oliveira
Model
Methodology
Message
Foster’s Design Methodology
AdvancedComputing
M.A. Oliveira
Model
Methodology
Message
Foster’s Design Methodology
A four step process for designing parallel algorithms:
Partitioning: Process of dividing the computation and thedata into primitive tasks. Can be done in a data-centricapproach (domain decomposition) or computation-centricapproach (functional decomposition).
Communication: Determining the communication patternbetween the primitive tasks and identifying local and globalcommunication.
Agglomeration: Process of grouping tasks into larger tasksin order to improve performance, simplify programmingand lower communication overhead.
Mapping: Process of assigning tasks to processors.
AdvancedComputing
M.A. Oliveira
Model
Methodology
Message
Foster’s Design Methodology
AdvancedComputing
M.A. Oliveira
Model
Methodology
Message
Message
AdvancedComputing
M.A. Oliveira
Model
Methodology
Message
Message
Most programming problems have several parallel solutions.
The best solution may differ from that suggested byexisting sequential algorithms.
You should not parallelize a code (most times you end upparalyzing it!). The correct approach is to develop the parallelalgorithm.
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Designing formemoryhierarchies
Loopunrolling
Library usage
Part IV
Computer Architecture and Efficiency
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Designing formemoryhierarchies
Loopunrolling
Library usage
Introduction
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Designing formemoryhierarchies
Loopunrolling
Library usage
Introduction
In order to produce “fast” code one needs to understandmodern computer hardware.
Crucial Concept: Memory Hierarchy.
Advice: Keep simple, intelligible (unoptimized) code incomments for future reference.
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Introduction
Caches
Virtual Memory
Designing formemoryhierarchies
Loopunrolling
Library usage
Memory Hierarchy
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Introduction
Caches
Virtual Memory
Designing formemoryhierarchies
Loopunrolling
Library usage
Memory Hierarchy
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Introduction
Caches
Virtual Memory
Designing formemoryhierarchies
Loopunrolling
Library usage
Caches
Fastest memory are the registers. Operations act directlyon their values. Can be accessed in one clock cycle.
Cache is inside the CPU so is limited and fixed.
Caches are broken up into cache lines which are shortblocks of memory that copy memory ranges from mainmemory.
Blocks of memory in cache are indexed by a table inhardware.
Memory accesses from the CPU can be cache hits or cachemisses.
It is good programming practice to use memory sequentially tobe able to reuse cache lines and avoid cache misses.
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Introduction
Caches
Virtual Memory
Designing formemoryhierarchies
Loopunrolling
Library usage
Virtual Memory
Virtual memory is the usage of storage devices to increasethe amount of available memory.
Blocks of virtual memory are called pages.
When memory is accessed a table is locked at to find itsphysical address. If that address is not in main memory apage is read from storage and the table updated, this iscalled a page fault.
Generating many page faults is called thrashing.
It is good programming practice to use memory sequentially tobe able to reuse pages and avoid page faults.
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Designing formemoryhierarchies
Introduction
Temporallocality
Inline smallfunctions
if-then-else
compact loops
Spacial locality
Loop indexorder
Blocking
Blockreordering
Dynamic datastructures
Loopunrolling
Library usage
Designing for memory hierarchies
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Designing formemoryhierarchies
Introduction
Temporallocality
Inline smallfunctions
if-then-else
compact loops
Spacial locality
Loop indexorder
Blocking
Blockreordering
Dynamic datastructures
Loopunrolling
Library usage
Introduction
Main principle: Locality
Temporal locality - executable codeSpacial locality - data
For good temporal locality code should avoid jumping toomuch. Most times a jump or a branch to a far-awayaddress occurs cache lines need to be filled with newinstructions.
For good spacial locality code should use memorysequentially.
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Designing formemoryhierarchies
Introduction
Temporallocality
Inline smallfunctions
if-then-else
compact loops
Spacial locality
Loop indexorder
Blocking
Blockreordering
Dynamic datastructures
Loopunrolling
Library usage
Inline small functions
Consider the codes:
double x[N], y[N], z[N]; double d_add(double a, double b)
... { return a+b; }
for( i=0 ; i<N ; ++i ) ...
z[i]=x[i]+y[i]; for( i=0 ; i<N ; ++i )
z[i]=d_add(x[i],y[i]);
Function calls involve setting up, and later destroying, astack frame.
Function calls usually involve branching to somewhere faraway in memory.
Variables usually have to be saved from registers tomemory and vice-versa.
If you really need code like this consider inlining smallfunctions.
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Designing formemoryhierarchies
Introduction
Temporallocality
Inline smallfunctions
if-then-else
compact loops
Spacial locality
Loop indexorder
Blocking
Blockreordering
Dynamic datastructures
Loopunrolling
Library usage
if-then-else
The if-then-else tend to reduce temporal locality. Avoid it ifpossible in crucial sections of the code like for loops as itreduces locality and makes it much more difficult for compilersto optimize the code.
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Designing formemoryhierarchies
Introduction
Temporallocality
Inline smallfunctions
if-then-else
compact loops
Spacial locality
Loop indexorder
Blocking
Blockreordering
Dynamic datastructures
Loopunrolling
Library usage
Compact loops
Keeps your loops as compact as possible so achieve betterlocality and improve the chance of compiler optimizations.
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Designing formemoryhierarchies
Introduction
Temporallocality
Inline smallfunctions
if-then-else
compact loops
Spacial locality
Loop indexorder
Blocking
Blockreordering
Dynamic datastructures
Loopunrolling
Library usage
Loop index order
Always order your loops according to the wiseness of yourlanguage.
C/C++(row-wise ordering) Fortran(colum-wise ordering)
for( i=0 ; i<m ; ++i ) do j=1,n
for( j=0 ; j<n ; ++j ) do i=1,m
y[i]=y[i]+a[i][j]*x[j] y(i)=y(i)+a(i,j)*x(j)
enddo
enddo
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Designing formemoryhierarchies
Introduction
Temporallocality
Inline smallfunctions
if-then-else
compact loops
Spacial locality
Loop indexorder
Blocking
Blockreordering
Dynamic datastructures
Loopunrolling
Library usage
Blocking
In loops with multiple arrays it is often impossible that just byloop index reordering we achieve good spacial locality. This isthe case if both rows and columns are used simultaneously. Acommon practice is to operate on blocks of the matrices(sub-matrices). The goal is to maximize access to the dataloaded into cache before it gets replaced. This technique iscalled blocking.
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Designing formemoryhierarchies
Introduction
Temporallocality
Inline smallfunctions
if-then-else
compact loops
Spacial locality
Loop indexorder
Blocking
Blockreordering
Dynamic datastructures
Loopunrolling
Library usage
Blocking
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Designing formemoryhierarchies
Introduction
Temporallocality
Inline smallfunctions
if-then-else
compact loops
Spacial locality
Loop indexorder
Blocking
Blockreordering
Dynamic datastructures
Loopunrolling
Library usage
Blocking
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Designing formemoryhierarchies
Introduction
Temporallocality
Inline smallfunctions
if-then-else
compact loops
Spacial locality
Loop indexorder
Blocking
Blockreordering
Dynamic datastructures
Loopunrolling
Library usage
Block reordering
Suppose we want to multiply two large matrices nxn and we usea block algorithm:
[A11 A12A21 A22
] [B11 B12B21 B22
]=
[A11B11 + A12B21 A11B12 + A12B22A21B11 + A22B21 A21B12 + A22B22
]
This algorithm can be applied recursively such that every blockfits in the available cache.In this case instead of storing the matrices in either row-wise orcolumn-wise order we can store them in a way that makes thealgorithm particularly convenient: to store A we first store A11,then A12, followed by A12 and A22. Each sub-matrix should berecursively stored in the same fashion.
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Designing formemoryhierarchies
Introduction
Temporallocality
Inline smallfunctions
if-then-else
compact loops
Spacial locality
Loop indexorder
Blocking
Blockreordering
Dynamic datastructures
Loopunrolling
Library usage
Block reordering
For and 8x8 matrix we would have:
1 2 5 6 17 18 21 223 4 7 8 19 20 23 249 10 13 14 25 26 29 30
11 12 15 16 27 28 31 3233 34 37 38 49 50 53 5435 36 39 40 51 52 55 5641 42 45 46 57 58 61 6243 44 47 48 59 60 63 64
Such algorithm would produce a cache-oblivious algorithm.
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Designing formemoryhierarchies
Introduction
Temporallocality
Inline smallfunctions
if-then-else
compact loops
Spacial locality
Loop indexorder
Blocking
Blockreordering
Dynamic datastructures
Loopunrolling
Library usage
Block reordering
Cache-oblivious algorithms have however several issues thatneed to be taken into account:
Computing The address of a specific entry is much morecomplex than for standard matrix layouts.
Natural algorithms are recursive which are difficult tooptimize by compilers and recursive routines cannot beinlined.
Natural base cases for the recursion is 1x1 or 2x2 blockmatrices which are too small for modern pipelined CPUs.
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Designing formemoryhierarchies
Introduction
Temporallocality
Inline smallfunctions
if-then-else
compact loops
Spacial locality
Loop indexorder
Blocking
Blockreordering
Dynamic datastructures
Loopunrolling
Library usage
Dynamic data structures
Dynamic data structures, like lists and trees, are commonlyused in advanced algorithms.
Creation and destruction of nodes may destroy memorylocality.
How can we create dynamic data structures and still keepmemory locality?
Copy the data structure after creation to force sequentialnode creation before any critical calculation.Use allocation/deallocation routines that force memorylocality.
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Designing formemoryhierarchies
Loopunrolling
Library usage
Loop unrolling
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Designing formemoryhierarchies
Loopunrolling
Library usage
Loop unrolling
Consider the codes:
sum=0.0; sum0=sum1=sum2=sum3=0.0;
for(i=0;i<n;++i) for(i=0;i<n;i+=4)
sum=sum+a[i]*b[i] {
sum0=sum0+a[i]*b[i];
sum1=sum1+a[i+1]*b[i+1];
sum2=sum2+a[i+2]*b[i+2];
sum3=sum3+a[i+3]*b[i+3];
}
for(i=4*(n/4);i<n;++i)
sum0=sum0+a[i]*b[i];
Most CPUs use pipelined units to perform these operations. Inthe first version each multiply-add has to clear the pipeline tobe added to “sum”. On the second version the pipeline is keptfull and more multiply-adds are performed.This technique is called loop unrolling.How much loop unrolling is needed depends on the details of thearchitecture and the effectiveness of the compiler optimizations.
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Designing formemoryhierarchies
Loopunrolling
Library usage
Library usage
AdvancedComputing
M.A. Oliveira
Introduction
MemoryHierarchy
Designing formemoryhierarchies
Loopunrolling
Library usage
Library usage
In any algorithm performance hotspot NEVER, EVER useyour own routines for numerical operations.
ALWAYS use libraries.
ALWAYS choose CPU optimized libraries if available.
Typical - linear algebra:
Manufacturer versionATLAS or GOTOoperating system versionyour own version
AdvancedComputing
M.A. Oliveira
Introduction
Speedup andEfficiency
Amdahl’s law
Gustafson-Barsis’Law
Summary onspeeduplimits
TheKarp-Flattmetric
Part V
Performance Analysis
AdvancedComputing
M.A. Oliveira
Introduction
Speedup andEfficiency
Amdahl’s law
Gustafson-Barsis’Law
Summary onspeeduplimits
TheKarp-Flattmetric
Introduction
AdvancedComputing
M.A. Oliveira
Introduction
Speedup andEfficiency
Amdahl’s law
Gustafson-Barsis’Law
Summary onspeeduplimits
TheKarp-Flattmetric
Introduction
Being able to accurately predict the performance of aparallel algorithm you have designed can help you decidewhether to actually go to the trouble of coding anddebugging it.
Being able to analyze the execution time exhibited by aparallel program can help you understand the barriers tohigher performance and predict how much improvementcan be realized by increasing the number of processors.
AdvancedComputing
M.A. Oliveira
Introduction
Speedup andEfficiency
Amdahl’s law
Gustafson-Barsis’Law
Summary onspeeduplimits
TheKarp-Flattmetric
Speedup and Efficiency
AdvancedComputing
M.A. Oliveira
Introduction
Speedup andEfficiency
Amdahl’s law
Gustafson-Barsis’Law
Summary onspeeduplimits
TheKarp-Flattmetric
Speedup and Efficiency
Speedup = Sequential execution timeParallel execution time = ψ(n, p)
Efficiency = Speedup# Processors = ε(n, p)
AdvancedComputing
M.A. Oliveira
Introduction
Speedup andEfficiency
Amdahl’s law
Gustafson-Barsis’Law
Summary onspeeduplimits
TheKarp-Flattmetric
Speedup and Efficiency
Time in parallel algorithms is used in:
Sequential execution: σ(n)Parallel execution: ϕ(n)Parallel Overhead: κ(n, p)
Speedup and efficiency can hence be written as:
ψ(n, p) ≤ σ(n)+ϕ(n)
σ(n)+ ϕ(n)p +κ(n,p)
ε(n, p) ≤ σ(n)+ϕ(n)pσ(n)+ϕ(n)+pκ(n,p)
AdvancedComputing
M.A. Oliveira
Introduction
Speedup andEfficiency
Amdahl’s law
Gustafson-Barsis’Law
Summary onspeeduplimits
TheKarp-Flattmetric
Amdahl’s law
AdvancedComputing
M.A. Oliveira
Introduction
Speedup andEfficiency
Amdahl’s law
Gustafson-Barsis’Law
Summary onspeeduplimits
TheKarp-Flattmetric
Amdahl’s law
Consider the expression for the speedup ψ(n, p). Sinceκ(n, p) < 0 we have:
ψ(n, p) ≤ σ(n)+ϕ(n)
σ(n)+ ϕ(n)p +κ(n,p)
≤ σ(n)+ϕ(n)
σ(n)+ ϕ(n)p
Let f denote the sequential portion of the computation:
f = σ(n)σ(n)+ϕ(n)
Substituting we obtain:
ψ(n, p) ≤ 1f + 1−f
p
AdvancedComputing
M.A. Oliveira
Introduction
Speedup andEfficiency
Amdahl’s law
Gustafson-Barsis’Law
Summary onspeeduplimits
TheKarp-Flattmetric
Amdahl’s law
Amdahl’s Law: Let f be the fraction of operations in acomputation that must be performed sequentially. Themaximum speedup ψ achievable by a parallel computer with pprocessors is:
ψ(n, p) ≤ 1f + 1−f
p
Corollary: The maximum achievable speedup by a parallelalgorithm with a fraction f of its code performed sequentially is:
limp→+∞
ψ(n, p) ≤ 1f
AdvancedComputing
M.A. Oliveira
Introduction
Speedup andEfficiency
Amdahl’s law
Gustafson-Barsis’Law
Summary onspeeduplimits
TheKarp-Flattmetric
Amdahl’s Effect
Typically κ(n, p) has lower complexity than ϕ(n). Increasing thesize of the problem increases the computation time faster thanit increases the parallel overhead. Hence for a fixed number ofprocesses, speedup is usually an increasing function of problemsize. This is known as Amdahl’s effect.
AdvancedComputing
M.A. Oliveira
Introduction
Speedup andEfficiency
Amdahl’s law
Gustafson-Barsis’Law
Summary onspeeduplimits
TheKarp-Flattmetric
Gustafson-Barsis’ Law
AdvancedComputing
M.A. Oliveira
Introduction
Speedup andEfficiency
Amdahl’s law
Gustafson-Barsis’Law
Summary onspeeduplimits
TheKarp-Flattmetric
Gustafson-Barsis’ Law
Consider the expression for the speedup ψ(n, p). Sinceκ(n, p) < 0 we have:
ψ(n, p) ≤ σ(n)+ϕ(n)
σ(n)+ ϕ(n)p +κ(n,p)
≤ σ(n)+ϕ(n)
σ(n)+ ϕ(n)p
Let s denote the fraction of time spent in the parallelexecution performing sequential operations:
s = σ(n)
σ(n)+ ϕ(n)p
Substituting we obtain:
ψ(n, p) ≤ p + (1− p)s
AdvancedComputing
M.A. Oliveira
Introduction
Speedup andEfficiency
Amdahl’s law
Gustafson-Barsis’Law
Summary onspeeduplimits
TheKarp-Flattmetric
Gustafson-Barsis’ Law
Gustafson-Barsis’ Law: Given a parallel program solving aproblem of size n using p processors, let s denote the fraction oftotal execution time spent in serial code. The maximumspeedup (usually called scaled speedup) ψ achievable by thisprogram is:
ψ(n, p) ≤ p + (1− p)s
AdvancedComputing
M.A. Oliveira
Introduction
Speedup andEfficiency
Amdahl’s law
Gustafson-Barsis’Law
Summary onspeeduplimits
TheKarp-Flattmetric
Summary on speedup limits
AdvancedComputing
M.A. Oliveira
Introduction
Speedup andEfficiency
Amdahl’s law
Gustafson-Barsis’Law
Summary onspeeduplimits
TheKarp-Flattmetric
Summary on speedup limits
Amdahl’s law determines speedup by taking the serialcomputation and predicting how quickly that computationcould execute on multiple processors.
Gustafson-Barsis’ begins with a parallel computation andestimates how much faster the parallel computation is thanthe same computation executing on a single processor.
Both laws ignore parallel overhead.
AdvancedComputing
M.A. Oliveira
Introduction
Speedup andEfficiency
Amdahl’s law
Gustafson-Barsis’Law
Summary onspeeduplimits
TheKarp-Flattmetric
The Karp-Flatt metric
AdvancedComputing
M.A. Oliveira
Introduction
Speedup andEfficiency
Amdahl’s law
Gustafson-Barsis’Law
Summary onspeeduplimits
TheKarp-Flattmetric
The Karp-Flatt metric
We define the experimentally determined serial fraction e of aparallel computation as:
T (n, p) = T (n, 1)e + T (n,1)p (1− e)
The Karp-Flat metric: Given a parallel computation exhibitingspeedup ψ on p processors the experimentally determined serialfraction e is:
e =1ψ− 1
p
1− 1p
AdvancedComputing
M.A. Oliveira
Introduction
Speedup andEfficiency
Amdahl’s law
Gustafson-Barsis’Law
Summary onspeeduplimits
TheKarp-Flattmetric
The Karp-Flatt metric
The Karp-Flatt metric is useful for:
Takes into account parallel overhead
Helps detect sources of overhead and inefficiency.
AdvancedComputing
M.A. Oliveira
Intro
OpenMP
Part VI
OpenMP
AdvancedComputing
M.A. Oliveira
Intro
Definition
Pros & Cons
Architecture
Components
Model
Communications
OpenMPIntroduction
AdvancedComputing
M.A. Oliveira
Intro
Definition
Pros & Cons
Architecture
Components
Model
Communications
OpenMP
Definition
OpenMP is a parallel programming model for sharedmemory and distributed shared memory multiprocessors.
OpenMP is not a computer language. It can be used onFORTRAN and C/C++.
OpenMP requires a compliant compiler.
AdvancedComputing
M.A. Oliveira
Intro
Definition
Pros & Cons
Architecture
Components
Model
Communications
OpenMP
Advantages and Disadvantages of OpenMP
Advantages
Easier to learn.Parallelization can be incremental.Coarse-grain and fine-grain parallelization.Widely available.Portable.
Disadvantages
Limited Scalability.Impossible to use directly on distributed memory systems.
AdvancedComputing
M.A. Oliveira
Intro
Definition
Pros & Cons
Architecture
Components
Model
Communications
OpenMP
Components
Application
Threads in Operating System
Runtime Libraries
User
Environment VariablesCompiler Directives
AdvancedComputing
M.A. Oliveira
Intro
Definition
Pros & Cons
Architecture
Components
Model
Communications
OpenMP
Components
Compiler Directives.
Run-time libraries.
Environment variables.
AdvancedComputing
M.A. Oliveira
Intro
Definition
Pros & Cons
Architecture
Components
Model
Communications
OpenMP
Execution Model: The fork/join model
Parallel regions are the building “blocks” within the code.
A master thread is started at runtime and persiststhroughout execution.
The master thread starts the team of threads at parallelregions.
joinMas
ter
thre
ad fork
AdvancedComputing
M.A. Oliveira
Intro
Definition
Pros & Cons
Architecture
Components
Model
Communications
OpenMP
Thread Communication
Every thread has access to global memory - shared memory
Every thread has access to stack memory - private memory.
Code should use shared memory to communicate betweenthreads.
Simultaneous updates to shared memory can create a racecondition. Results change with different thread scheduling.
Use mutual exclusion to avoid data sharing but don’t usetoo much because it will serialize performance.
AdvancedComputing
M.A. Oliveira
Intro
OpenMP
Syntax andSentinels
Parallel Control
CombinedDirectives
DataEnvironment
DirectiveClauses
OpenMP
AdvancedComputing
M.A. Oliveira
Intro
OpenMP
Syntax andSentinels
Parallel Control
CombinedDirectives
DataEnvironment
DirectiveClauses
Syntax and Sentinels
AdvancedComputing
M.A. Oliveira
Intro
OpenMP
Syntax andSentinels
Parallel Control
CombinedDirectives
DataEnvironment
DirectiveClauses
Parallel Control Directives
OpenMP provides two kinds of directives to control parallelism:
To create multiple threads.
To divide work among an existing set of parallel threads.
AdvancedComputing
M.A. Oliveira
Intro
OpenMP
Syntax andSentinels
Parallel Control
CombinedDirectives
DataEnvironment
DirectiveClauses
Combined Directives
Some directives can however be combined together:
AdvancedComputing
M.A. Oliveira
Intro
OpenMP
Syntax andSentinels
Parallel Control
CombinedDirectives
DataEnvironment
DirectiveClauses
Data Environment
Only the master thread has a data address space that lastsfor the entire duration of the run.
During the fork operation OpenMP chooses to either sharea single copy between all the threads or provide eachthread with its own private copy for the duration of theparallel construct.
There are defaults for choosing if a variable is shared orprivate.
User is advised to always specify the character of thevariable.
AdvancedComputing
M.A. Oliveira
Intro
OpenMP
Syntax andSentinels
Parallel Control
CombinedDirectives
DataEnvironment
DirectiveClauses
Data Environment
Default behavior can be overridden with data scopingclauses:
FORTRAN
!$OMP PARALLEL DEFAULT(NONE) &
!$OMP SHARED(...) PRIVATE(...) REDUCTION(...)
C/C++
#pragma omp parallel default(none) \
shared(...) private(...) reduction(...)
Initialization of private variables can also be controlled withfirstprivate.
Return value of private variables can also be controlledwith lastprivate.
AdvancedComputing
M.A. Oliveira
Intro
OpenMP
Syntax andSentinels
Parallel Control
CombinedDirectives
DataEnvironment
DirectiveClauses
Data Environment: Reductions
An “operation” that combines multiple elements to form asingle result is called a reduction operation.
The variable that accumulates the result is called thereduction variable.
Reduction operators and variables must be declared.
sum=0
!$omp parallel do reduction(+ : sum)
do i = 1, n
sum = sum + a(i)
enddo
AdvancedComputing
M.A. Oliveira
Intro
OpenMP
Syntax andSentinels
Parallel Control
CombinedDirectives
DataEnvironment
DirectiveClauses
Directive Clauses
Clauses control the behavior of OpenMP directives.
AdvancedComputing
M.A. Oliveira
Synchronisation
Runtimelibraries
Environmentvariables
Compilingand Runningan OpenMPprogramme
Part VII
More OpenMP
AdvancedComputing
M.A. Oliveira
Synchronisation
Runtimelibraries
Environmentvariables
Compilingand Runningan OpenMPprogramme
Synchronisation
AdvancedComputing
M.A. Oliveira
Synchronisation
Runtimelibraries
Environmentvariables
Compilingand Runningan OpenMPprogramme
Synchronisation
The term synchronization refers to the mechanism by which aparallel program can coordinate the execution of multiplethreads. The two most common forms of synchronization:
Mutual exclusion: way to control access to a sharedvariable - critical/atomic directive.
Event synchronization: way to control that threads reach aparticular point in execution simultaneously - barrierdirective.
AdvancedComputing
M.A. Oliveira
Synchronisation
Runtimelibraries
Environmentvariables
Compilingand Runningan OpenMPprogramme
Synchronisation
AdvancedComputing
M.A. Oliveira
Synchronisation
Runtimelibraries
Environmentvariables
Compilingand Runningan OpenMPprogramme
Runtime libraries
AdvancedComputing
M.A. Oliveira
Synchronisation
Runtimelibraries
Environmentvariables
Compilingand Runningan OpenMPprogramme
Runtime libraries
AdvancedComputing
M.A. Oliveira
Synchronisation
Runtimelibraries
Environmentvariables
Compilingand Runningan OpenMPprogramme
Environment variables
AdvancedComputing
M.A. Oliveira
Synchronisation
Runtimelibraries
Environmentvariables
Compilingand Runningan OpenMPprogramme
Environment variables
AdvancedComputing
M.A. Oliveira
Synchronisation
Runtimelibraries
Environmentvariables
Compilingand Runningan OpenMPprogramme
Compiling
Running
Compiling and Running an OpenMPprogramme
AdvancedComputing
M.A. Oliveira
Synchronisation
Runtimelibraries
Environmentvariables
Compilingand Runningan OpenMPprogramme
Compiling
Running
Compiling an OpenMP program
Compiling an OpenMP program depends on the compiler:
Intel Fortran: ifort -openmp input file
Pathscale Fortran: pathf90 -mp input file
GCC: gfortran -mopenmp input file (version > 4.2)
AdvancedComputing
M.A. Oliveira
Synchronisation
Runtimelibraries
Environmentvariables
Compilingand Runningan OpenMPprogramme
Compiling
Running
Running an OpenMP program
_maolive@phorcys:~> pathf90 -mp hello3.f90 -o hello3
_maolive@phorcys:~> export OMP_NUM_THREADS=5
_maolive@phorcys:~> ./hello3
Hello world from thread 1 !
Hello world from thread 0 !
Hello world from thread 4 !
Hello world from thread 3 !
Hello world from thread 2 !
AdvancedComputing
M.A. Oliveira
MPI
Definition
Pros & Cons
Fundamentals
Basic MPI
Example
Compiling
Part VIII
Introduction to MPI
AdvancedComputing
M.A. Oliveira
MPI
Definition
Pros & Cons
Fundamentals
Basic MPI
Example
Compiling
MPI
AdvancedComputing
M.A. Oliveira
MPI
Definition
Pros & Cons
Fundamentals
Basic MPI
Example
Compiling
Definition
AdvancedComputing
M.A. Oliveira
MPI
Definition
Pros & Cons
Fundamentals
Basic MPI
Example
Compiling
Definition
MPI is a parallel programming model for message passingmultiprocessing.
It consists of a set of library calls to enable the multipleprocesses to communicate.
In MPI all instances of your parallel application are startedfrom launch and have private address spaces.
MPI is not a computer language. It can be used onFORTRAN and C/C++.
AdvancedComputing
M.A. Oliveira
MPI
Definition
Pros & Cons
Fundamentals
Basic MPI
Example
Compiling
Advantages and Disadvantages
AdvancedComputing
M.A. Oliveira
MPI
Definition
Pros & Cons
Fundamentals
Basic MPI
Example
Compiling
Advantages and Disadvantages
Advantages
ScalabilityCompatible even with shared memory systemsWidely availablePortable
Disadvantages
Harder to learnNon incremental parallelization
AdvancedComputing
M.A. Oliveira
MPI
Definition
Pros & Cons
Fundamentals
Basic MPI
Example
Compiling
Fundamentals
AdvancedComputing
M.A. Oliveira
MPI
Definition
Pros & Cons
Fundamentals
Basic MPI
Example
Compiling
Fundamentals
MPI is based on the message passing paradigm with aminimal interface as:
send(address,count,datatype,destination,tag,comm)receive(address,maxcount,datatype,source,tag,comm)
Messages can be either point-to-point or collective and canbe blocking or non-blocking.
Processes belong to a group. In each group a process isidentified by a its rank.
Each group of processes is characterized by an objectcalled a communicator.
AdvancedComputing
M.A. Oliveira
MPI
Definition
Pros & Cons
Fundamentals
Basic MPI
Example
Compiling
Basic MPI
AdvancedComputing
M.A. Oliveira
MPI
Definition
Pros & Cons
Fundamentals
Basic MPI
Example
Compiling
Basic MPI Tasks
Initialization and Termination
Setting up Communicators
Point to Point Communication
Collective Communication
AdvancedComputing
M.A. Oliveira
MPI
Definition
Pros & Cons
Fundamentals
Basic MPI
Example
Compiling
Basic Routines
MPI Init Initialize MPI
MPI Comm size Find out how many processes there are
MPI Comm rank Find out which process I am
MPI Send Send a message
MPI Recv Receive a message
MPI Finalize Terminate MPI
MPI Bcast Broadcast a message
MPI Reduce Reduce a message
MPI Barrier Blocks until all processes reach the barrier
MPI Abort Terminates MPI with an error code
AdvancedComputing
M.A. Oliveira
MPI
Definition
Pros & Cons
Fundamentals
Basic MPI
Example
Compiling
Beyond Basic?
All other functions just add:
flexibility (datatypes),robustness (nonblocking send/receive),efficiency,modularity (groups, communicators),convenience (collective operations, topologies).
AdvancedComputing
M.A. Oliveira
MPI
Definition
Pros & Cons
Fundamentals
Basic MPI
Example
Compiling
Example
AdvancedComputing
M.A. Oliveira
MPI
Definition
Pros & Cons
Fundamentals
Basic MPI
Example
Compiling
Initialization and Termination
program param
include mpif.h
call mpi_init(ierr)
call mpi_comm_size(MPI_COMM_WORLD,np,ierr)
call mpi_comm_rank(MPI_COMM_WORLD,mype,ierr)
...
do_something_useful
...
call mpi_finalize(ierr)
end program
AdvancedComputing
M.A. Oliveira
MPI
Definition
Pros & Cons
Fundamentals
Basic MPI
Example
Compiling
Compiling
Running
Compiling and Running an MPI program
AdvancedComputing
M.A. Oliveira
MPI
Definition
Pros & Cons
Fundamentals
Basic MPI
Example
Compiling
Compiling
Running
Compiling
Generally use a special compiler or compiler wrapper script:
not defined by the standard
consult your implementation
handles correct include path, library path, and libraries
MPICH-style (the most common)C
mpicc –o foo foo.c
Fortran
mpif77 –o foo foo.f
AdvancedComputing
M.A. Oliveira
MPI
Definition
Pros & Cons
Fundamentals
Basic MPI
Example
Compiling
Compiling
Running
Running
MPI programs require some help to get started
what computers should I run on?
how do I access them?
MPICH-style
mpirun –np 10 –machinefile mach ./foo
When batch schedulers are involved, all bets are off
AdvancedComputing
M.A. Oliveira
Collectives
P2P
Part IX
MPI
AdvancedComputing
M.A. Oliveira
Collectives
Synchronization
Data Mov
Adv Data Mov
D Mov & Comp
P2P
Collective Communication
AdvancedComputing
M.A. Oliveira
Collectives
Synchronization
Data Mov
Adv Data Mov
D Mov & Comp
P2P
Collective Communication
Collective Communications
Involves all processes within a communicator
There are three basic types:
SynchronizationData movementData movement with computation
MPI Collectives are blocking
MPI Collectives do not use message tags
AdvancedComputing
M.A. Oliveira
Collectives
Synchronization
Data Mov
Adv Data Mov
D Mov & Comp
P2P
Synchronization: Barrier
A barrier is a synchronization primitive.A node calling it will block until all the nodes within the grouphave called it.
MPI BARRIER(COMM, IERR)
AdvancedComputing
M.A. Oliveira
Collectives
Synchronization
Data Mov
Adv Data Mov
D Mov & Comp
P2P
Data Movement: Broadcast
MPI provides the broadcast primitive to send data to allcommunicator members:MPI BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM, IERR)
It must be called by each node in the group with the sameCOMM and ROOT.
AdvancedComputing
M.A. Oliveira
Collectives
Synchronization
Data Mov
Adv Data Mov
D Mov & Comp
P2P
Data Movement: Scatter, Gather
Gather and scatter are inverse operations.Gather collects data from every member of the group onthe root node in linear order by the rank of the node.Scatter parcels out data from the root to every member ofthe group in linear order by node.
MPI GATHER(SNDBUF,SCOUNT,DATATYPE,RECVBUF,RCOUNT,RDATATYPE,ROOT,COMM,IERR)
MPI SCATTER(SNDBUF,SCOUNT,DATATYPE,RECVBUF,RCOUNT,RDATATYPE,ROOT,COMM,IERR)
AdvancedComputing
M.A. Oliveira
Collectives
Synchronization
Data Mov
Adv Data Mov
D Mov & Comp
P2P
Data Movement: All Gather
Provides a more efficient way to do a gather followed by abroadcast: all members of the group receive the collected data.
MPI ALLGATHER(SNDBUF, SCOUNT, SDATATYPE, RECVBUF, RCOUNT, RDATATYPE, COMM, IERR)
AdvancedComputing
M.A. Oliveira
Collectives
Synchronization
Data Mov
Adv Data Mov
D Mov & Comp
P2P
Data Movement: All to All
The jth block sent from node i is received by node j and isplaced in the ith block.This is typically useful for implementing Fast Fourier Transform.
MPI ALLTOALL(SNDBUF, SCOUNT, SDATATYPE, RECVBUF, RCOUNT, RDATATYPE, COMM, IERR)
AdvancedComputing
M.A. Oliveira
Collectives
Synchronization
Data Mov
Adv Data Mov
D Mov & Comp
P2P
Advanced Data Movement Primitives
MPI {Scatter,Gather,Allgather,Alltoall}vv stands varying size, relative location of messages
Advantages:
flexibilityless need to copy data into temporary buffersmore compact
Disadvantages
Harder to program - more bookkeeping
AdvancedComputing
M.A. Oliveira
Collectives
Synchronization
Data Mov
Adv Data Mov
D Mov & Comp
P2P
Generalized Data Movement Primitives
Scatter vs Scatterv
AdvancedComputing
M.A. Oliveira
Collectives
Synchronization
Data Mov
Adv Data Mov
D Mov & Comp
P2P
Data movement with computation: Reduce
A global reduction combines partial results from each node inthe group using some basic function and distributes the answerto the root node:
MPI REDUCE(SNDBUF, RECVBUF, COUNT, DATATYPE, OPERATOR, ROOT, COMM, IERR)
AdvancedComputing
M.A. Oliveira
Collectives
Synchronization
Data Mov
Adv Data Mov
D Mov & Comp
P2P
Data movement with computation: AllReduce
Variants of the reduce operation where the result is returned toall processes in the group.
MPI ALLREDUCE(SNDBUF,RECVBUF,COUNT,DATATYPE,OPERATOR,COMM,IERR)
AdvancedComputing
M.A. Oliveira
Collectives
Synchronization
Data Mov
Adv Data Mov
D Mov & Comp
P2P
Data movement with computation: Scan
A scan performs partial reductions on distributed data.Specifically, the scan operation returns the reduction of thevalues in the send buffers of processes ranked 0, 1, ..., n intothe receive buffer of the node ranked n.MPI SCAN(SNDBUF, RECVBUF, COUNT, DATATYPE, OPERATOR, COMM, IERR)
AdvancedComputing
M.A. Oliveira
Collectives
Synchronization
Data Mov
Adv Data Mov
D Mov & Comp
P2P
Data movement with computation: The operations
MPI SUM sumMPI PROD productMPI MAX maximum valueMPI MIN minimum valueMPI MAXLOC max. value locationMPI MINLOC min. value locationMPI LAND Logical andMPI LOR Logical orMPI LXOR Logical xorMPI BAND Bitwise andMPI BOR Bitwise orMPI BXOR Bitwise xor
Table: Common predefined operations for MPI collectives
AdvancedComputing
M.A. Oliveira
Collectives
P2P
Deadlock
UP2P
BP2P
Avoid Deadlock Point to Point Communication
AdvancedComputing
M.A. Oliveira
Collectives
P2P
Deadlock
UP2P
BP2P
Avoid Deadlock
Deadlock
Deadlock is a phenomenon most common with blockingcommunication. It occurs when all tasks are waiting for eventsthat haven’t been initiated yet.
AdvancedComputing
M.A. Oliveira
Collectives
P2P
Deadlock
UP2P
BP2P
Avoid Deadlock
Communication Behavior
Point to point communication functions have two distinctbehaviors: blocking and non-blocking.
Blocking
Execution is suspended until the message buffer is safe toreuse.
Non-blocking
Execution is not suspended. User is responsible for testingor waiting until message buffer is safe to reuse.
AdvancedComputing
M.A. Oliveira
Collectives
P2P
Deadlock
UP2P
BP2P
Avoid Deadlock
Communication Modes
For each communication behavior there are four modes:
Standard
Implementation dependent. Should be best compromise forreliability and performance.
Synchronous
The sending task sends the receiver a “ready to send”message. When receiver replies with “ready to receive”data is transferred.
Ready
When called a “ready to receive” message has to havearrived otherwise an error is generated.
Buffered
Copies the message to an user supplied buffer and returns.
AdvancedComputing
M.A. Oliveira
Collectives
P2P
Deadlock
UP2P
BP2P
Avoid Deadlock
Unidirectional Point to point Communication
Mode Blocking Non-blocking
Standard MPI Send MPI Isend
Synchronous MPI Ssend MPI ISsend
Ready MPI Rsend MPI IRsend
Buffered MPI Bsend MPI IBsend
MPI Recv MPI IRecv
AdvancedComputing
M.A. Oliveira
Collectives
P2P
Deadlock
UP2P
BP2P
Avoid Deadlock
Blocking Standard Mode
AdvancedComputing
M.A. Oliveira
Collectives
P2P
Deadlock
UP2P
BP2P
Avoid Deadlock
Blocking Synchronous Mode
AdvancedComputing
M.A. Oliveira
Collectives
P2P
Deadlock
UP2P
BP2P
Avoid Deadlock
Blocking Ready Mode
AdvancedComputing
M.A. Oliveira
Collectives
P2P
Deadlock
UP2P
BP2P
Avoid Deadlock
Blocking Buffered Mode
AdvancedComputing
M.A. Oliveira
Collectives
P2P
Deadlock
UP2P
BP2P
Avoid Deadlock
Non-Blocking Standard Mode
AdvancedComputing
M.A. Oliveira
Collectives
P2P
Deadlock
UP2P
BP2P
Avoid Deadlock
Point to point Communications
Ready Mode has least total overhead. However the assumption is thatreceive is already posted.
Synchronous mode is portable and “safe”. It does not depend on order(ready) or buffer space (buffered). However it incurs substantial overhead.
Buffered mode decouples sender from receiver. No sync. overhead onsending task and order of execution does not matter (ready). User cancontrol size of message buffers and total amount of space. Howeveradditional overhead may be incurred by copy to buffer and buffer space canrun out.
Standard Mode is implementation dependent. Small messages are generallybuffered (avoiding sync. overhead) and large messages are usually sentsynchronously (avoiding the required buffer space).
AdvancedComputing
M.A. Oliveira
Collectives
P2P
Deadlock
UP2P
BP2P
Avoid Deadlock
To block or not to block?!
This code worked on one machine but does not work in general.Why?
! SEND DATA
LM=6*NES+2
DO I=1,NUMPRC
NT=I-1
IF (NT.NE.MYPRC) THEN
print *,myprc,'send',msgtag,'to',ntCALL MPI_SEND(NWS,LM,MPI_INTEGER,NT,MSGTAG, MPI_COMM_WORLD,IERR)
ENDIF
END DO
! RECEIVE DATA
LM=6*100+2
DO I=2,NUMPRC
CALL MPI_RECV(NWS,LM,MPI_INTEGER, & MPI_ANY_SOURCE,MSGTAG,MPI_COMM_WORLD,IERR)
! do something with data
END DO
AdvancedComputing
M.A. Oliveira
Collectives
P2P
Deadlock
UP2P
BP2P
Avoid Deadlock
Bidirectional point to point communications
The MPI SENDRECV function provides an efficient way tohandle a common situation where a processor needs to senddata to another processor and receive data from the sameprocessor, or a different one. Calling MPI SENDRECV is similarto calling MPI SEND followed by a call to MPI RECV.MPI SENDRECV(sendbuf, sendcount, sendtype, dest, sendtag, recvbuf, recvcount, recvtype, source, recvtag,comm, status)
MPI SENDRECV REPLACE is a version of the previousprimitive that allows the send buffer and receive buffer to bethe same, which in effect replaces the contents of the sendbuffer by that of the received buffer.MPI SENDRECV REPLACE(buf,count,datatype,dest,sendtag, source,recvtag,comm,status,ierr)
AdvancedComputing
M.A. Oliveira
Collectives
P2P
Deadlock
UP2P
BP2P
Avoid Deadlock
Avoiding Deadlock
different ordering of calls
non-blocking calls
bidirectional calls
buffered mode
AdvancedComputing
M.A. Oliveira
Datatypes
Comm &Groups
Topologies
Wildcards
Timing
MPI I/O
Part X
More MPI
AdvancedComputing
M.A. Oliveira
Datatypes
Basic Types
D Datatypes
Comm &Groups
Topologies
Wildcards
Timing
MPI I/O
Derived Datatypes
AdvancedComputing
M.A. Oliveira
Datatypes
Basic Types
D Datatypes
Comm &Groups
Topologies
Wildcards
Timing
MPI I/O
Basic Types
MPI Datatype Fortran Datatype
MPI BYTE
MPI CHARACTER CHARACTER
MPI COMPLEX COMPLEX
MPI DOUBLE PRECISION DOUBLE PRECISION
MPI INTEGER INTEGER
MPI LOGICAL LOGICAL
MPI PACKED
MPI REAL REAL
Table: Basic predefined datatypes for Fortran
AdvancedComputing
M.A. Oliveira
Datatypes
Basic Types
D Datatypes
Comm &Groups
Topologies
Wildcards
Timing
MPI I/O
Derived Datatypes
MPI basic data-types are predefined for contiguous data ofsingle type
What if application has data of mixed types, ornon-contiguous data?
existing solutions of multiple calls or copying into bufferand packing etc. are slow, clumsy and waste memorybetter solution is creating/deriving datatypes for thesespecial needs from existing datatypes
Derived Data-types can be created recursively and atrun-time
Automatic packing and unpacking
AdvancedComputing
M.A. Oliveira
Datatypes
Basic Types
D Datatypes
Comm &Groups
Topologies
Wildcards
Timing
MPI I/O
Derived Datatypes
Elementary: Language-defined types
Contiguous: Vector with stride of one
Vector: Separated by constant “stride”
Hvector: Vector, with stride in bytes
Indexed: Array of indices (for scatter/gather)
Hindexed: Indexed, with indices in bytes
Struct: General mixed types (for C structs etc.)
AdvancedComputing
M.A. Oliveira
Datatypes
Basic Types
D Datatypes
Comm &Groups
Topologies
Wildcards
Timing
MPI I/O
Derived Datatypes
call MPI_Type_contiguous(count, datatype, newdatatype, ierr)
call MPI_Type_commit(newdatatype, ierr)
...
call MPI_Type_free(newdatatype,ierr)
call MPI_Type_vector(ncols, 1, nrows, MPI_DOUBLE_PRECISION, vtype, ierr)
call MPI_Type_commit(vtype, ierr)
...
call MPI_Send( A(nrows,1), 1, vtype ...)
...
call MPI_Type_free(vtype,ierr)
AdvancedComputing
M.A. Oliveira
Datatypes
Comm &Groups
Topologies
Wildcards
Timing
MPI I/O
Communicators and Groups
AdvancedComputing
M.A. Oliveira
Datatypes
Comm &Groups
Topologies
Wildcards
Timing
MPI I/O
Communicators and Groups
All MPI communication is relative to a communicator whichcontains a context and a group.The group is just a set of processes.
10
MPI_COMM_WORLD
COMM_1
COMM_2
0 21
3 4
432
AdvancedComputing
M.A. Oliveira
Datatypes
Comm &Groups
Topologies
Wildcards
Timing
MPI I/O
Communicators and Groups
To handle communicators and groups we can usually deploy twostrategies:
Group manipulation - More general.
Direct communicator manipulation - Usually more compactand suitable for regular decompositions.
AdvancedComputing
M.A. Oliveira
Datatypes
Comm &Groups
Topologies
Wildcards
Timing
MPI I/O
Communicators and Groups
First strategy
MPI_Init(&argc,&argv);
world = MPI_COMM_WORLD;
MPI_Comm_size(world,&numprocs);
MPI_Comm_rank(world,&myid);
server=numprocs-1;
MPI_Comm_group(world,&world_group);
ranks[0] = server;
MPI_Group_excl(world_group,1,ranks,&worker_group);
MPI_Comm_create(world,worker_group,&workers);
MPI_Group_free(&worker_group);
MPI_Group_free(&world_group);
...
MPI_Comm_free(&workers);
AdvancedComputing
M.A. Oliveira
Datatypes
Comm &Groups
Topologies
Wildcards
Timing
MPI I/O
Communicators and Groups
Second strategy
color = ( myid == server);
MPI_Comm_split(world, color, 0, &workers);
AdvancedComputing
M.A. Oliveira
Datatypes
Comm &Groups
Topologies
Wildcards
Timing
MPI I/O
Topologies
AdvancedComputing
M.A. Oliveira
Datatypes
Comm &Groups
Topologies
Wildcards
Timing
MPI I/O
Topologies
Another way to use communicators is to orgnize tasks on agiven topology.
Cartesian topologies are predifined and allow to lay outMPI tasks on a cartesian coordinate grid.
(3,1)
(3,0)
(3,2)(2,2)(1,2)(0,2)
(0,1)
(0,0) (1,0) (2,0)
(1,1) (2,1)
AdvancedComputing
M.A. Oliveira
Datatypes
Comm &Groups
Topologies
Wildcards
Timing
MPI I/O
Wildcards
AdvancedComputing
M.A. Oliveira
Datatypes
Comm &Groups
Topologies
Wildcards
Timing
MPI I/O
Wildcards
Enables programmer to avoid having to specify a tagand/or a source:
MPI ANY TAGMPI ANY SOURCE
Enables programmer to keep algorithms general:MPI PROC NULL
can be used in send and/or receiveoperation complets immediatelyno communication involved
AdvancedComputing
M.A. Oliveira
Datatypes
Comm &Groups
Topologies
Wildcards
Timing
MPI I/O
Timing
AdvancedComputing
M.A. Oliveira
Datatypes
Comm &Groups
Topologies
Wildcards
Timing
MPI I/O
Timing
...
call MPI_Barrier(MPI_COMM_WORLD,ierr)
t=MPI_Wtime()
...
call MPI_Barrier(MPI_COMM_WORLD,ierr)
t=MPI_Wtime()-t
AdvancedComputing
M.A. Oliveira
Datatypes
Comm &Groups
Topologies
Wildcards
Timing
MPI I/O
Introduction
I/O types inMPI programs
Parallel MPII/O to a singlefile
MPI I/O
AdvancedComputing
M.A. Oliveira
Datatypes
Comm &Groups
Topologies
Wildcards
Timing
MPI I/O
Introduction
I/O types inMPI programs
Parallel MPII/O to a singlefile
Introduction
Programming Languages have predefined functions tohandle files.
Typical operations are open, close, read and write.
MPI supports counterparts.
These MPI routines can conveniently express parallelism.
AdvancedComputing
M.A. Oliveira
Datatypes
Comm &Groups
Topologies
Wildcards
Timing
MPI I/O
Introduction
I/O types inMPI programs
Parallel MPII/O to a singlefile
I/O types in MPI programs
Non-parallel I/O
I/O to separate files
Parallel I/O
AdvancedComputing
M.A. Oliveira
Datatypes
Comm &Groups
Topologies
Wildcards
Timing
MPI I/O
Introduction
I/O types inMPI programs
Parallel MPII/O to a singlefile
Parallel MPI I/O to a single file
#include "mpi.h"
#include <stdio.h>
#define BUFSIZE 100
int main(int argc, char *argv[])
{
int i, myrank, buf[BUFSIZE];
MPI_File thefile;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
for (i=0; i<BUFSIZE; i++)
buf[i] = myrank * BUFSIZE + i;
MPI_File_open(MPI_COMM_WORLD, "testfile",
MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &thefile);
MPI_File_set_view(thefile, myrank * BUFSIZE * sizeof(int),
MPI_INT, MPI_INT, "native", MPI_INFO_NULL);
MPI_File_write(thefile, buf, BUFSIZE, MPI_INT, MPI_STATUS_IGNORE);
MPI_File_close(&thefile);
MPI_Finalize();
return 0;
}
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Overlap
Comm
Part XI
Hybrid Programming
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Overlap
Comm
Hybrid Programming
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Overlap
Comm
Distributed and Shared Memory Systems
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Overlap
Comm
Distributed and Shared Memory Systems
Combines distributed memory parallelization with on-nodeshared memory parallelization.
Largest systems now employ both architectures.
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Overlap
Comm
Why Hybrid Computing
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Overlap
Comm
Why Hybrid Computing
Eliminates domain decomposition at node
Automatic coherency at node
Lower memory latency and data movement within node
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Task & Threads
Code
Types
Initiialization
Funneled
Serialized
Multi-Threaded
Overlap
Comm
Modes of Hybrid Computing
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Task & Threads
Code
Types
Initiialization
Funneled
Serialized
Multi-Threaded
Overlap
Comm
Modes of Hybrid Computing - Tasks and Threads
Fixing the execution to a particular processing unit and memory bank canspeed up execution. Consider using “numactl”...
Running hybrid codes on a queuing system requires special configuration orskillful scripting.
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Task & Threads
Code
Types
Initiialization
Funneled
Serialized
Multi-Threaded
Overlap
Comm
Hybrid Coding
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Task & Threads
Code
Types
Initiialization
Funneled
Serialized
Multi-Threaded
Overlap
Comm
Modes of Hybrid Computing - Types
Support Level Description
MPI THREAD SINGLE Only one thread will execute.
MPI THREAD FUNNELED Process may be multithreadedbut only main thread can makeMPI calls. Default mode.
MPI THREAD SERIALIZE Process my be multithreaded,any thread can make MPIcalls but threads cannot exe-cute MPI calls concurrently.
MPI THREAD MULTIPLE Multiple threads may call MPI.No restrictions.
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Task & Threads
Code
Types
Initiialization
Funneled
Serialized
Multi-Threaded
Overlap
Comm
MPI Initialization
Fortran: call MPI Init thread( required, provided, ierr)
C: int MPI Init thread(int *argc, char ***argv, intrequired, int *provided)
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Task & Threads
Code
Types
Initiialization
Funneled
Serialized
Multi-Threaded
Overlap
Comm
Funneled Mode
MPI THREAD FUNNELED
Use OMP BARRIER since there is no implicit barrier inmaster work-share construct (OMP MASTER).
All other threads will be sleeping.
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Task & Threads
Code
Types
Initiialization
Funneled
Serialized
Multi-Threaded
Overlap
Comm
Funneled Mode
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Task & Threads
Code
Types
Initiialization
Funneled
Serialized
Multi-Threaded
Overlap
Comm
Serialized Mode
MPI THREAD SERIALIZED
Only OMP BARRIER at beginning, since there is animplicit barrier in SINGLE work-share construct(OMP SINGLE).
All other threads will be sleeping.
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Task & Threads
Code
Types
Initiialization
Funneled
Serialized
Multi-Threaded
Overlap
Comm
Serialized Mode
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Task & Threads
Code
Types
Initiialization
Funneled
Serialized
Multi-Threaded
Overlap
Comm
Multi-Threaded Mode
MPI THREAD MULTIPLE
No restrictions.
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Task & Threads
Code
Types
Initiialization
Funneled
Serialized
Multi-Threaded
Overlap
Comm
Multi-threaded Mode
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Overlap
Comm
Overlapping Communication and Work
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Overlap
Comm
Overlapping Communication and Work
One core can saturate the PCI to network bus. Why use allto communicate?
Communicate with one or several cores.
Work with others during communication.
Need at least MPI THREAD FUNNELED support.
Can be difficult to manage and load balance!
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Overlap
Comm
Overlapping Communication and Work
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Overlap
Comm
Thread-rank Communication
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Overlap
Comm
Thread-rank Communication
Can use thread and rank id in communication.
Usual technique in multi-thread is to use tags todistinguish threads.
AdvancedComputing
M.A. Oliveira
HybridProgramming
DSM
Why
Modes
Overlap
Comm
Thread-rank Communication
AdvancedComputing
M.A. Oliveira
Development
Part XII
Introduction to Software Development Tools
AdvancedComputing
M.A. Oliveira
Development
Introduction to Software DevelopmentTools
AdvancedComputing
M.A. Oliveira
Development
Introduction to Software Development Tools
Software Development Tools
Timers
Compiler Report & Listings
Debugger
Libraries
Compiler Optimizations
Profilers
Am I being naive?
Hardware Performance Counters
Does my code run?
What is compiler doing?
Time it!!!
Slow?
Is it optimal?
Is it really optimal?
Sucess!!!
Code Development
AdvancedComputing
M.A. Oliveira
Libraries
Part XIII
Libraries
AdvancedComputing
M.A. Oliveira
Libraries
Why ParallelLibraries?
PerformanceLibraries
OptimizedLibraries
Common HPCLibraries Libraries
AdvancedComputing
M.A. Oliveira
Libraries
Why ParallelLibraries?
PerformanceLibraries
OptimizedLibraries
Common HPCLibraries
Why Parallel Libraries?
Like most programming tasks, very little “real” software iscreated by starting from a blank slate and coding every lineof every algorithm.
Large scale parallel software construction involvessignificant code reuse, making use of libraries thatencapsulate much of what we learned.
AdvancedComputing
M.A. Oliveira
Libraries
Why ParallelLibraries?
PerformanceLibraries
OptimizedLibraries
Common HPCLibraries
Performance Libraries
Optimized for specific architectures.
Use library routines instead of hand-coding your own.
Offered by different vendors:
ESSL/PESSL on IBM systems.Intel MKL for IA32, EM64T and IA64.Cray libsci for Cray systems.SCSL for SGI.ACML for AMD.
AdvancedComputing
M.A. Oliveira
Libraries
Why ParallelLibraries?
PerformanceLibraries
OptimizedLibraries
Common HPCLibraries
Optimized Libraries
Use optimized libraries:
In “hot spots”, never write library functions by hand.Numerical Recipes books DO NOT provide optimized code.(Libraries can be 100x faster).
AdvancedComputing
M.A. Oliveira
Libraries
Why ParallelLibraries?
PerformanceLibraries
OptimizedLibraries
Common HPCLibraries
Common HPC Libraries
SPRNG - Parallel Random Numbers
FFTW - Parallel FFT (MPI, OpenMP)
ScaLAPack - Parallel Linear Algebra (MPI)
GOTOATLASIntel Math Kernel Libraries (MKL)
Parallel Linear Algebra+ (OpenMP)
PETSc - Parallel PDEs and related problems (MPI)
AdvancedComputing
M.A. Oliveira
CompilerOptimization
Part XIV
Compiler Optimization
AdvancedComputing
M.A. Oliveira
CompilerOptimization
Introduction
OptimizationLevels
Intel CompilerOptions
PGI CompilerOptions
Conclusions
Compiler Optimization
AdvancedComputing
M.A. Oliveira
CompilerOptimization
Introduction
OptimizationLevels
Intel CompilerOptions
PGI CompilerOptions
Conclusions
Compiler Optimizations
The compiler now does a very good job of optimizing codeso you don’t have to.
But, program developers should ensure that their codes areadaptable to hardware evolution and are scalable.
AdvancedComputing
M.A. Oliveira
CompilerOptimization
Introduction
OptimizationLevels
Intel CompilerOptions
PGI CompilerOptions
Conclusions
Optimization Level: -On
-O0 no optimization: Fast compilation, disablesoptimization.
-O1 optimization for speed, keeps code size small.
-O2 low to moderate optimization: partial debuggingsupport, disables inlining.
-O3 aggressive optimization: compile time/ space intensiveand/or marginal effectiveness; may change code semanticsand results (sometimes even breaks codes!)
AdvancedComputing
M.A. Oliveira
CompilerOptimization
Introduction
OptimizationLevels
Intel CompilerOptions
PGI CompilerOptions
Conclusions
Optimization Levels
Operations performed at moderate optimization levels
instruction reschedulingcopy propagationsoftware pipeliningcommon subexpression eliminationprefetching, loop transformations
Operations performed at aggressive optimization levels
enables -O3more aggressive prefetching, loop transformations
AdvancedComputing
M.A. Oliveira
CompilerOptimization
Introduction
OptimizationLevels
Intel CompilerOptions
PGI CompilerOptions
Conclusions
Intel Compiler Options
Processor-specific optimization options
-xT generates specialized code for EM64T, includes SSE4
Other optimization options:
-mp maintain floating point precision (disables someoptimizations).-mp1 improve floating-point precision (speed impact is lessthan -mp).-ip enable single-file interprocedural (IP) optimizations(within files).-ipo enable multi-file IP optimizations (between files)
AdvancedComputing
M.A. Oliveira
CompilerOptimization
Introduction
OptimizationLevels
Intel CompilerOptions
PGI CompilerOptions
Conclusions
Intel Compiler Options
Other options:
-g debugging information, generates symbol table.
-strict ansi strict ANSI compliance.
-C enable extensive runtime error checking.
-convert kwd specify file format keyword: big endian, cray,ibm, little endian, native, vaxd
-openmp enable the parallelizer to generate multi-threadedcode based on the OpenMP directives.
-static create a static executable for serial applications.
AdvancedComputing
M.A. Oliveira
CompilerOptimization
Introduction
OptimizationLevels
Intel CompilerOptions
PGI CompilerOptions
Conclusions
Intel Compiler - Best Practice
Normal compiling ifort –O3 –xT test.c
Try compiling at -O3 -xT.
If code breaks or gives wrong answers with -O3 xT, firsttry -mp (maintain precision).
-O2 is default opt, compile with -O0 if this breaks (veryrare)
-xT can include optimizations and may break some codes
Don’t include debug options for a production compile!
ifort -O2 -g test.c
AdvancedComputing
M.A. Oliveira
CompilerOptimization
Introduction
OptimizationLevels
Intel CompilerOptions
PGI CompilerOptions
Conclusions
PGI Compilers
-03 performs some compile time and memory intensiveoptimizations in addition to those executed with -O2, butmay not improve performance for all programs.
Mipa=fast Interprocedural optimizations There is a loaderproblem with this option.
-tp barcelona-64 includes specialized code for the barcelonachip.
-fast
-O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline-Mvect=sse -Mscalarsse -Mcache align -Mflushz
-mp enable the parallelizer to generate multi-threaded codebased on the OpenMP directives.
-Minfo=mp,ipa Information about OpenMP,interprocedural optimization-helplists options.
AdvancedComputing
M.A. Oliveira
CompilerOptimization
Introduction
OptimizationLevels
Intel CompilerOptions
PGI CompilerOptions
Conclusions
Tuning Parameters
Different for each compiler.
Differences even between compiler versions.
Make sure you test and use them. You may missing out onsome well deserved performance boost.But, proceed with caution. ,
AdvancedComputing
M.A. Oliveira
Debuggingand Profiling
AnalysesTools
Part XV
Debugging and Profiling
AdvancedComputing
M.A. Oliveira
Debuggingand Profiling
StandardDebuggers
DebuggingBasics
CommercialDebuggers
AnalysesTools
Debugging and Profiling
AdvancedComputing
M.A. Oliveira
Debuggingand Profiling
StandardDebuggers
DebuggingBasics
CommercialDebuggers
AnalysesTools
Standard Debuggers
The standard command line debugging tool is gdb inLinux. You can use this debugger for programs written inC, C++ and Fortran.
There are graphical frontends for gdb.
AdvancedComputing
M.A. Oliveira
Debuggingand Profiling
StandardDebuggers
DebuggingBasics
CommercialDebuggers
AnalysesTools
Debugging Basics
For effective debugging a couple of commands need to bemastered:
set breakpoints: regular and conditionaldisplay the value of variablesset new valuesstep through a program
Less used commands can be learned as they becomenecessary.
AdvancedComputing
M.A. Oliveira
Debuggingand Profiling
StandardDebuggers
DebuggingBasics
CommercialDebuggers
AnalysesTools
gdb Basics
Common commands for gdb:
run - starts the program; if you do not set up anybreakpoints the program will run until it terminates or coredumps.
print - prints a variable located in the current scope.
next - executes the current command, and moves to thenext command in the program.
step - steps through the next command.
break - sets a break point.
continue - used to continue till next breakpoint ortermination
Note: if you are at a function call, and you issue next, then thefunction will execute and return. However, if you issue step,then you will go to the first line of that function
AdvancedComputing
M.A. Oliveira
Debuggingand Profiling
StandardDebuggers
DebuggingBasics
CommercialDebuggers
AnalysesTools
gdb Basics
More commands for gdb:
list - show code listing near the current execution location
delete - delete a breakpoint
condition - make a breakpoint conditional
display - continuously display value
undisplay - remove displayed value
where - show current function stack trace
help - display help text
quit - exit gdb
AdvancedComputing
M.A. Oliveira
Debuggingand Profiling
StandardDebuggers
DebuggingBasics
CommercialDebuggers
AnalysesTools
Commercial Debuggers: DDT & Totalview
Interactive, parallel, symbolic debuggers with GUI interface.
Works with C, C++ and Fortran Compilers
Available for many different platforms.
Supports OpenMP & MPI (and hybrid paradigm)
Support 32- and 64-bit architectures
Simple to use (intuitive)
AdvancedComputing
M.A. Oliveira
Debuggingand Profiling
AnalysesTools
Goals
CompilerReports &Listings
Timers
Profilers
HardwarePerformanceCounters
Analyses Tools
AdvancedComputing
M.A. Oliveira
Debuggingand Profiling
AnalysesTools
Goals
CompilerReports &Listings
Timers
Profilers
HardwarePerformanceCounters
Performance Analysis Goals
Identify hotspot candidates for further study andoptimization potential
Test optimization changes to verify usefulness
Floating-point improvements aimed at reducing overallwall-clock run time (but may potentially reduce scalability)
MPI improvements aimed at reducing MPI Idle time andimprove scalability
AdvancedComputing
M.A. Oliveira
Debuggingand Profiling
AnalysesTools
Goals
CompilerReports &Listings
Timers
Profilers
HardwarePerformanceCounters
Compiler Reports & Listings
Compilers will optionally generate optimization reports &listing files.
Use the Loader Map to determine what libraries you haveloaded.
To activate them:
-Minfo=time,loop,inline,sym... (pgi)-opt-report (optimization,Intel)-S (listing,Intel)
AdvancedComputing
M.A. Oliveira
Debuggingand Profiling
AnalysesTools
Goals
CompilerReports &Listings
Timers
Profilers
HardwarePerformanceCounters
Timers
AdvancedComputing
M.A. Oliveira
Debuggingand Profiling
AnalysesTools
Goals
CompilerReports &Listings
Timers
Profilers
HardwarePerformanceCounters
Profilers
Profilers determine:
Call Tree
Wall clock/CPU time spent on each function
HW counters (e.g., cache misses, FLOPs)
A common Unix profiling tool is gprof.
AdvancedComputing
M.A. Oliveira
Debuggingand Profiling
AnalysesTools
Goals
CompilerReports &Listings
Timers
Profilers
HardwarePerformanceCounters
Profilers
AdvancedComputing
M.A. Oliveira
Debuggingand Profiling
AnalysesTools
Goals
CompilerReports &Listings
Timers
Profilers
HardwarePerformanceCounters
Hardware Performance Counters
Information obtained with profilers is usual extremelyhelpful but may not be enough to completely optimize acode.
Hardware performance counters usually requireinstrumenting the code but give a much more detailedview.
There are several solutions:
PAPITAUPDT
AdvancedComputing
M.A. Oliveira
HPCAvailableResources
Part XVI
HPC Available Resources
AdvancedComputing
M.A. Oliveira
HPCAvailableResources
HPC AvailableResources:Portugal
HPC AvailableResources:Europe HPC Available Resources
AdvancedComputing
M.A. Oliveira
HPCAvailableResources
HPC AvailableResources:Portugal
HPC AvailableResources:Europe
HPC Available Resources: Portugal
MILIPEIA - http://www.lca.uc.pt
INGRID - http://www.gridcomputing.pt
RNCA - http://www.rnca.org.pt
AdvancedComputing
M.A. Oliveira
HPCAvailableResources
HPC AvailableResources:Portugal
HPC AvailableResources:Europe
HPC Available Resources: Europe
HPCEuropa - http://www.hpc-europa.org
PRACE - http://www.prace-project.eu