Advanced Computing M.A. Oliveira Topics in Advanced...

AdvancedComputing

M.A. Oliveira

OutlineTopics in Advanced Computing

Miguel Afonso Oliveira

Laboratorio de Instrumentacao e Fısica Experimental de Partıculas

LIP

Instituto de Telecomunicacoes - Polo de Coimbra

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP

Part VII: MoreOpenMP

Part VIII: Introto MPI

Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid

Part XII:Development

Part XIII:Libraries

Part XIV:Compilers

Part XV:Debug & Prof

Part XVI:Resources

Outline

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP



Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid


Part XIII:Libraries

Part XIV:Compilers


Part XVI:Resources

Part I: Introduction to Parallel Computing

1 Science and ComputersClassical Scientific MethodModern Scientific MethodThe Grand Challenges in Science

2 Motivating ParallelismComputational Power ArgumentMemory/Disk Speed ArgumentThe Data Communication Argument

3 Parallel ComputingWhat is Parallel Computing?Why do Parallel Computing?

4 Message

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP



Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid


Part XIII:Libraries

Part XIV:Compilers


Part XVI:Resources

Part II: Parallel Programing Platforms

5 Parallel ComputersFlynn’s Taxonomy: Early daysFlynn’s Taxonomy: RecentlyMemory model Taxonomy

6 Parallel ProgrammingThe Two Extreme ModelsParallel Programming: The Real World

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP



Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid


Part XIII:Libraries

Part XIV:Compilers


Part XVI:Resources

Part III: Principles of Parallel Algorithm Design

7 The Task/Channel Model

8 Foster’s Design Methodology

9 Message

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP



Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid


Part XIII:Libraries

Part XIV:Compilers


Part XVI:Resources

Part IV: Computer Architecture and Efficiency

10 Introduction11 Memory Hierarchy

IntroductionCachesVirtual Memory

12 Designing for memory hierarchiesIntroductionTemporal locality

Inline small functionsif-then-elsecompact loops

Spacial localityLoop index orderBlockingBlock reorderingDynamic data structures

13 Loop unrolling14 Library usage

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP



Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid


Part XIII:Libraries

Part XIV:Compilers


Part XVI:Resources

Part V: Performance Analysis

15 Introduction

16 Speedup and Efficiency

17 Amdahl’s law

18 Gustafson-Barsis’ Law

19 Summary on speedup limits

20 The Karp-Flatt metric

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP



Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid


Part XIII:Libraries

Part XIV:Compilers


Part XVI:Resources

Part VI: OpenMP

21 IntroductionDefinitionAdvantages and Disadvantages of OpenMPArchitectureComponentsExecution ModelThread Communication

22 OpenMPSyntax and SentinelsParallel Control DirectivesCombined DirectivesData EnvironmentDirective Clauses

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP



Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid


Part XIII:Libraries

Part XIV:Compilers


Part XVI:Resources

Part VII: More OpenMP

23 Synchronisation

24 Runtime libraries

25 Environment variables

26 Compiling and Running an OpenMP programmeCompilingRunning

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP



Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid


Part XIII:Libraries

Part XIV:Compilers


Part XVI:Resources

Part VIII: MPI

27 MPI

28 Definition

29 Advantages and Disadvantages

30 Fundamentals

31 Basic MPI

32 Example

33 Compiling and Running an MPI programCompilingRunning

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP



Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid


Part XIII:Libraries

Part XIV:Compilers


Part XVI:Resources

Part IX: MPI

34 Collective CommunicationSynchronizationData MovementAdvanced Data Movement PrimitivesData movement with computation

35 Point to Point CommunicationDeadlockUnidirectional point to point communicationsBidirectional point to point communicationsAvoid Deadlock

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP



Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid


Part XIII:Libraries

Part XIV:Compilers


Part XVI:Resources

Part X: More MPI

36 Derived DatatypesBasic TypesDerived Datatypes

37 Communicators and Groups

38 Topologies

39 Wildcards

40 Timing

41 MPI I/OIntroductionI/O types in MPI programsParallel MPI I/O to a single file

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP



Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid


Part XIII:Libraries

Part XIV:Compilers


Part XVI:Resources

Part XI: Hybrid programming

42 Hybrid Programming

43 Distributed and Shared Memory Systems

44 Why Hybrid Computing

45 Modes of Hybrid ComputingTasks and ThreadsHybrid CodingTypesMPI InitializationFunneled ModeSerialized ModeMulti-Threaded Mode

46 Overlapping Communication and Work

47 Thread-rank Communication

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP



Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid


Part XIII:Libraries

Part XIV:Compilers


Part XVI:Resources

Part XII: Introduction to Software DevelopmentTools

48 Introduction to Software Development Tools

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP



Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid


Part XIII:Libraries

Part XIV:Compilers


Part XVI:Resources

Part XIII: Libraries

49 LibrariesWhy Parallel Libraries?Performance LibrariesOptimized LibrariesCommon HPC Libraries

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP



Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid


Part XIII:Libraries

Part XIV:Compilers


Part XVI:Resources

Part XIV: Compiler Optimization

50 Compiler OptimizationIntroductionOptimization LevelsIntel Compiler OptionsPGI Compiler OptionsConclusions

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP



Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid


Part XIII:Libraries

Part XIV:Compilers


Part XVI:Resources

Part XV: Debugging and Profiling

51 Debugging and ProfilingStandard DebuggersDebugging BasicsCommercial Debuggers

52 Analyses ToolsGoalsCompiler Reports & ListingsTimersProfilersHardware Performance Counters

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP



Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid


Part XIII:Libraries

Part XIV:Compilers


Part XVI:Resources

Part XVI: HPC Available Resources

53 HPC Available ResourcesHPC Available Resources: PortugalHPC Available Resources: Europe

AdvancedComputing

M.A. Oliveira

Science andComputers

MotivatingParallelism

ParallelComputing

Message

Part I

Introduction to Parallel Computing

AdvancedComputing

M.A. Oliveira


ClassicalScientificMethod

ModernScientificMethod

The GrandChallenges inScience


ParallelComputing

Message

Science and Computers

AdvancedComputing

M.A. Oliveira






ParallelComputing

Message

Classical Science

Nature

Observa,on

Theory Experimenta,on

AdvancedComputing

M.A. Oliveira






ParallelComputing

Message

Modern Science

Nature

Observa,on

Theory Experimenta,on Numerical Simula,on

AdvancedComputing

M.A. Oliveira






ParallelComputing

Message

The Grand Challenges in Science

Quantum Chemistry, Statistical Mechanics and RelativisticPhysics

Cosmology and Astrophysics

Computational Fluid Dynamics and Turbulance

Material Design and Superconductivity

Biology, Pharmacology, Genome Sequencing, GeneticEngineering, Protein Folding, Enzyme Activity and CellModeling

Medicine and Modeling of Human Organs and Bones

Global Weather and Environmental Modeling

AdvancedComputing

M.A. Oliveira



ComputationalPowerArgument

Memory/DiskSpeedArgument

The DataCommunicationArgument

ParallelComputing

Message

Motivating Parallelism

AdvancedComputing

M.A. Oliveira






ParallelComputing

Message

Computational Power Argument

Moore’s Law: Circuit complexity doubles every eighteenmonths.

Generalized Moore’s Law: The amount of computing poweravailable at a given cost doubles every eighteen months.

Critical Issue:

How do we translate transistors into use useful operationsper second?How do we use the transistors to achieve increasing rates ofcomputation?

Logical Answer:

Rely on parallelism (implicit and explicit).

AdvancedComputing

M.A. Oliveira






ParallelComputing

Message

Memory/Disk Speed Argument

Gap between processor speed and memory/disk presents atremendous performance bottleneck.

Usual compromise is to use hierarchy of caches...

Parallelism can also help because it yields bettermemory/disk system management:

larger aggregate caches.higher aggregate bandwidth.

AdvancedComputing

M.A. Oliveira






ParallelComputing

Message

The Data Communication Argument

Networking infrastructure evolved tremendously.

Why not use it as an heterogenous parallel/distributedcomputing environment?

Voluntary Computing

SETI@HOME

IBERCIVIS...

Grid Computing

WLCG...

AdvancedComputing

M.A. Oliveira



ParallelComputing

What?

Why?

Message

Parallel Computing

AdvancedComputing

M.A. Oliveira



ParallelComputing

What?

Why?

Message

What is Parallel Computing?

Parallel Computing: use of multiple processing units orcomputers for a common task.

Each processing unit works on its section of the problem.

Processing units can exchange information.

PU_2 works on thisare of the problem




AdvancedComputing

M.A. Oliveira



ParallelComputing

What?

Why?

Message

Why do Parallel Computing?

To compute beyond the limits of single PU systems:

achieve more performance;utilize more memory.

To be able to:

solve that can’t be solved in a reasonable time with singlePU systems;solve problems that don’t fit on a single PU system or evena single system.

So we can:

solve larger problems;solve problems faster;solve more problems.

AdvancedComputing

M.A. Oliveira



ParallelComputing

Message

Message

AdvancedComputing

M.A. Oliveira



ParallelComputing

Message

Message

If you can’t compute youcan’t compete!

AdvancedComputing

M.A. Oliveira

Taxonomy

Programming

Part II

Parallel Programing Platforms

AdvancedComputing

M.A. Oliveira

Taxonomy

Early days

Recently

Currently

Programming

Parallel Computers

AdvancedComputing

M.A. Oliveira

Taxonomy

Early days

Recently

Currently

Programming

Early days: Flynn’s Taxonomy

AdvancedComputing

M.A. Oliveira

Taxonomy

Early days

Recently

Currently

Programming

Recently: Flynn’s Taxonomy

SPMD

MIMD

Single Program Multiple Data Multiple Program Multiple Data

MPMD

AdvancedComputing

M.A. Oliveira

Taxonomy

Early days

Recently

Currently

Programming

Memory Model Taxonomy

Mem

_N

Parallel Computers

Shared Memory Distributed Memory

CPU_0

CPU_1

CPU_N

Inte

rcon

nec

t

Mem

ory

Inte

rcon

nec

t

CPU_0

CPU_1

CPU_N

Mem

_0

Mem

_1

AdvancedComputing

M.A. Oliveira

Taxonomy

Programming

The Models

Reality

Parallel Programming

AdvancedComputing

M.A. Oliveira

Taxonomy

Programming

The Models

Reality

The Two Extreme Parallel Programming Models

MPI

Parallel Computers

Distributed MemoryShared Memory

Shared MemoryProgramming

OpenMP

Message−PassingProgramming

The distributed model can be used directly on a sharedmemory system.

Using the shared memory model on a distributed memorysystem is only possible indirectly.

Both models can be combined to optimize performance.

AdvancedComputing

M.A. Oliveira

Taxonomy

Programming

The Models

Reality

Parallel Computers and Parallel Programming: TheReal World

RDMA

Homogeneous Systems Heterogeneous Systems

Parallel Programming

Distributed Partinioned

Global Address

Distributed

Memory

Shared Memory CPU+GPU CPU+Coprocessor

CPU+FPGA

Space − DGAS

Global Address

Space − PGAS

PVM

MPI

threads

OpenMP

Software

Hardware

OperatingSystem

Library/Language

Myrinet

InfinibandCluster

OpenMP

Intel

CAF

HPF

UPC

Chapel

IBM X10

SUN Fortress

CUDA

OpenCL

AdvancedComputing

M.A. Oliveira

Model

Methodology

Message

Part III

Principles of Parallel Algorithm Design

AdvancedComputing

M.A. Oliveira

Model

Methodology

Message

The Task/Channel Model

AdvancedComputing

M.A. Oliveira

Model

Methodology

Message


Parallel Computation: set of tasks interacting with eachother by sending messages through channels.

Task: it’s a program, it’s local memory and a collection ofI/O ports.

Communication: tasks can send local data values to othertasks via output ports and receive data values from othertasks via input ports.

Channel: it’s a message queue connecting one task’soutput port to another task’s input port.

Synchronicity: Message reception is synchronous (receivermust wait for message to arrive - blocking). Sending isasynchronous (sender never blocks).

AdvancedComputing

M.A. Oliveira

Model

Methodology

Message


AdvancedComputing

M.A. Oliveira

Model

Methodology

Message

Foster’s Design Methodology

AdvancedComputing

M.A. Oliveira

Model

Methodology

Message


A four step process for designing parallel algorithms:

Partitioning: Process of dividing the computation and thedata into primitive tasks. Can be done in a data-centricapproach (domain decomposition) or computation-centricapproach (functional decomposition).

Communication: Determining the communication patternbetween the primitive tasks and identifying local and globalcommunication.

Agglomeration: Process of grouping tasks into larger tasksin order to improve performance, simplify programmingand lower communication overhead.

Mapping: Process of assigning tasks to processors.

AdvancedComputing

M.A. Oliveira

Model

Methodology

Message


AdvancedComputing

M.A. Oliveira

Model

Methodology

Message

Message

AdvancedComputing

M.A. Oliveira

Model

Methodology

Message

Message

Most programming problems have several parallel solutions.

The best solution may differ from that suggested byexisting sequential algorithms.

You should not parallelize a code (most times you end upparalyzing it!). The correct approach is to develop the parallelalgorithm.

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Designing formemoryhierarchies

Loopunrolling

Library usage

Part IV

Computer Architecture and Efficiency

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy


Loopunrolling

Library usage

Introduction

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy


Loopunrolling

Library usage

Introduction

In order to produce “fast” code one needs to understandmodern computer hardware.

Crucial Concept: Memory Hierarchy.

Advice: Keep simple, intelligible (unoptimized) code incomments for future reference.

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Introduction

Caches

Virtual Memory


Loopunrolling

Library usage

Memory Hierarchy

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Introduction

Caches

Virtual Memory


Loopunrolling

Library usage

Caches

Fastest memory are the registers. Operations act directlyon their values. Can be accessed in one clock cycle.

Cache is inside the CPU so is limited and fixed.

Caches are broken up into cache lines which are shortblocks of memory that copy memory ranges from mainmemory.

Blocks of memory in cache are indexed by a table inhardware.

Memory accesses from the CPU can be cache hits or cachemisses.

It is good programming practice to use memory sequentially tobe able to reuse cache lines and avoid cache misses.

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Introduction

Caches

Virtual Memory


Loopunrolling

Library usage

Virtual Memory

Virtual memory is the usage of storage devices to increasethe amount of available memory.

Blocks of virtual memory are called pages.

When memory is accessed a table is locked at to find itsphysical address. If that address is not in main memory apage is read from storage and the table updated, this iscalled a page fault.

Generating many page faults is called thrashing.

It is good programming practice to use memory sequentially tobe able to reuse pages and avoid page faults.

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy


Introduction

Temporallocality

Inline smallfunctions

if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering

Dynamic datastructures

Loopunrolling

Library usage

Designing for memory hierarchies

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy


Introduction

Temporallocality


if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering


Loopunrolling

Library usage

Introduction

Main principle: Locality

Temporal locality - executable codeSpacial locality - data

For good temporal locality code should avoid jumping toomuch. Most times a jump or a branch to a far-awayaddress occurs cache lines need to be filled with newinstructions.

For good spacial locality code should use memorysequentially.

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy


Introduction

Temporallocality


if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering


Loopunrolling

Library usage

Inline small functions

Consider the codes:

double x[N], y[N], z[N]; double d_add(double a, double b)

... { return a+b; }

for( i=0 ; i<N ; ++i ) ...

z[i]=x[i]+y[i]; for( i=0 ; i<N ; ++i )

z[i]=d_add(x[i],y[i]);

Function calls involve setting up, and later destroying, astack frame.

Function calls usually involve branching to somewhere faraway in memory.

Variables usually have to be saved from registers tomemory and vice-versa.

If you really need code like this consider inlining smallfunctions.

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy


Introduction

Temporallocality


if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering


Loopunrolling

Library usage

if-then-else

The if-then-else tend to reduce temporal locality. Avoid it ifpossible in crucial sections of the code like for loops as itreduces locality and makes it much more difficult for compilersto optimize the code.

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy


Introduction

Temporallocality


if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering


Loopunrolling

Library usage

Compact loops

Keeps your loops as compact as possible so achieve betterlocality and improve the chance of compiler optimizations.

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy


Introduction

Temporallocality


if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering


Loopunrolling

Library usage

Loop index order

Always order your loops according to the wiseness of yourlanguage.

C/C++(row-wise ordering) Fortran(colum-wise ordering)

for( i=0 ; i<m ; ++i ) do j=1,n

for( j=0 ; j<n ; ++j ) do i=1,m

y[i]=y[i]+a[i][j]*x[j] y(i)=y(i)+a(i,j)*x(j)

enddo

enddo

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy


Introduction

Temporallocality


if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering


Loopunrolling

Library usage

Blocking

In loops with multiple arrays it is often impossible that just byloop index reordering we achieve good spacial locality. This isthe case if both rows and columns are used simultaneously. Acommon practice is to operate on blocks of the matrices(sub-matrices). The goal is to maximize access to the dataloaded into cache before it gets replaced. This technique iscalled blocking.

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy


Introduction

Temporallocality


if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering


Loopunrolling

Library usage

Blocking

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy


Introduction

Temporallocality


if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering


Loopunrolling

Library usage

Block reordering

Suppose we want to multiply two large matrices nxn and we usea block algorithm:

[A11 A12A21 A22

] [B11 B12B21 B22

]=

[A11B11 + A12B21 A11B12 + A12B22A21B11 + A22B21 A21B12 + A22B22

]

This algorithm can be applied recursively such that every blockfits in the available cache.In this case instead of storing the matrices in either row-wise orcolumn-wise order we can store them in a way that makes thealgorithm particularly convenient: to store A we first store A11,then A12, followed by A12 and A22. Each sub-matrix should berecursively stored in the same fashion.

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy


Introduction

Temporallocality


if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering


Loopunrolling

Library usage

Block reordering

For and 8x8 matrix we would have:

1 2 5 6 17 18 21 223 4 7 8 19 20 23 249 10 13 14 25 26 29 30

11 12 15 16 27 28 31 3233 34 37 38 49 50 53 5435 36 39 40 51 52 55 5641 42 45 46 57 58 61 6243 44 47 48 59 60 63 64

Such algorithm would produce a cache-oblivious algorithm.

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy


Introduction

Temporallocality


if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering


Loopunrolling

Library usage

Block reordering

Cache-oblivious algorithms have however several issues thatneed to be taken into account:

Computing The address of a specific entry is much morecomplex than for standard matrix layouts.

Natural algorithms are recursive which are difficult tooptimize by compilers and recursive routines cannot beinlined.

Natural base cases for the recursion is 1x1 or 2x2 blockmatrices which are too small for modern pipelined CPUs.

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy


Introduction

Temporallocality


if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering


Loopunrolling

Library usage

Dynamic data structures

Dynamic data structures, like lists and trees, are commonlyused in advanced algorithms.

Creation and destruction of nodes may destroy memorylocality.

How can we create dynamic data structures and still keepmemory locality?

Copy the data structure after creation to force sequentialnode creation before any critical calculation.Use allocation/deallocation routines that force memorylocality.

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy


Loopunrolling

Library usage

Loop unrolling

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy


Loopunrolling

Library usage

Loop unrolling

Consider the codes:

sum=0.0; sum0=sum1=sum2=sum3=0.0;

for(i=0;i<n;++i) for(i=0;i<n;i+=4)

sum=sum+a[i]*b[i] {

sum0=sum0+a[i]*b[i];

sum1=sum1+a[i+1]*b[i+1];



}

for(i=4*(n/4);i<n;++i)

sum0=sum0+a[i]*b[i];

Most CPUs use pipelined units to perform these operations. Inthe first version each multiply-add has to clear the pipeline tobe added to “sum”. On the second version the pipeline is keptfull and more multiply-adds are performed.This technique is called loop unrolling.How much loop unrolling is needed depends on the details of thearchitecture and the effectiveness of the compiler optimizations.

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy


Loopunrolling

Library usage

Library usage

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy


Loopunrolling

Library usage

Library usage

In any algorithm performance hotspot NEVER, EVER useyour own routines for numerical operations.

ALWAYS use libraries.

ALWAYS choose CPU optimized libraries if available.

Typical - linear algebra:

Manufacturer versionATLAS or GOTOoperating system versionyour own version

AdvancedComputing

M.A. Oliveira

Introduction

Speedup andEfficiency

Amdahl’s law

Gustafson-Barsis’Law

Summary onspeeduplimits

TheKarp-Flattmetric

Part V

Performance Analysis

AdvancedComputing

M.A. Oliveira

Introduction


Amdahl’s law



TheKarp-Flattmetric

Introduction

AdvancedComputing

M.A. Oliveira

Introduction


Amdahl’s law



TheKarp-Flattmetric

Introduction

Being able to accurately predict the performance of aparallel algorithm you have designed can help you decidewhether to actually go to the trouble of coding anddebugging it.

Being able to analyze the execution time exhibited by aparallel program can help you understand the barriers tohigher performance and predict how much improvementcan be realized by increasing the number of processors.

AdvancedComputing

M.A. Oliveira

Introduction


Amdahl’s law



TheKarp-Flattmetric

Speedup and Efficiency

AdvancedComputing

M.A. Oliveira

Introduction


Amdahl’s law



TheKarp-Flattmetric


Speedup = Sequential execution timeParallel execution time = ψ(n, p)

Efficiency = Speedup# Processors = ε(n, p)

AdvancedComputing

M.A. Oliveira

Introduction


Amdahl’s law



TheKarp-Flattmetric


Time in parallel algorithms is used in:

Sequential execution: σ(n)Parallel execution: ϕ(n)Parallel Overhead: κ(n, p)

Speedup and efficiency can hence be written as:

ψ(n, p) ≤ σ(n)+ϕ(n)

σ(n)+ ϕ(n)p +κ(n,p)

ε(n, p) ≤ σ(n)+ϕ(n)pσ(n)+ϕ(n)+pκ(n,p)

AdvancedComputing

M.A. Oliveira

Introduction


Amdahl’s law



TheKarp-Flattmetric

Amdahl’s law

AdvancedComputing

M.A. Oliveira

Introduction


Amdahl’s law



TheKarp-Flattmetric

Amdahl’s law

Consider the expression for the speedup ψ(n, p). Sinceκ(n, p) < 0 we have:

ψ(n, p) ≤ σ(n)+ϕ(n)


≤ σ(n)+ϕ(n)

σ(n)+ ϕ(n)p

Let f denote the sequential portion of the computation:

f = σ(n)σ(n)+ϕ(n)

Substituting we obtain:

ψ(n, p) ≤ 1f + 1−f

p

AdvancedComputing

M.A. Oliveira

Introduction


Amdahl’s law



TheKarp-Flattmetric

Amdahl’s law

Amdahl’s Law: Let f be the fraction of operations in acomputation that must be performed sequentially. Themaximum speedup ψ achievable by a parallel computer with pprocessors is:

ψ(n, p) ≤ 1f + 1−f

p

Corollary: The maximum achievable speedup by a parallelalgorithm with a fraction f of its code performed sequentially is:

limp→+∞

ψ(n, p) ≤ 1f

AdvancedComputing

M.A. Oliveira

Introduction


Amdahl’s law



TheKarp-Flattmetric

Amdahl’s Effect

Typically κ(n, p) has lower complexity than ϕ(n). Increasing thesize of the problem increases the computation time faster thanit increases the parallel overhead. Hence for a fixed number ofprocesses, speedup is usually an increasing function of problemsize. This is known as Amdahl’s effect.

AdvancedComputing

M.A. Oliveira

Introduction


Amdahl’s law



TheKarp-Flattmetric

Gustafson-Barsis’ Law

AdvancedComputing

M.A. Oliveira

Introduction


Amdahl’s law



TheKarp-Flattmetric


Consider the expression for the speedup ψ(n, p). Sinceκ(n, p) < 0 we have:

ψ(n, p) ≤ σ(n)+ϕ(n)


≤ σ(n)+ϕ(n)

σ(n)+ ϕ(n)p

Let s denote the fraction of time spent in the parallelexecution performing sequential operations:

s = σ(n)

σ(n)+ ϕ(n)p

Substituting we obtain:

ψ(n, p) ≤ p + (1− p)s

AdvancedComputing

M.A. Oliveira

Introduction


Amdahl’s law



TheKarp-Flattmetric


Gustafson-Barsis’ Law: Given a parallel program solving aproblem of size n using p processors, let s denote the fraction oftotal execution time spent in serial code. The maximumspeedup (usually called scaled speedup) ψ achievable by thisprogram is:

ψ(n, p) ≤ p + (1− p)s

AdvancedComputing

M.A. Oliveira

Introduction


Amdahl’s law



TheKarp-Flattmetric

Summary on speedup limits

AdvancedComputing

M.A. Oliveira

Introduction


Amdahl’s law



TheKarp-Flattmetric

Summary on speedup limits

Amdahl’s law determines speedup by taking the serialcomputation and predicting how quickly that computationcould execute on multiple processors.

Gustafson-Barsis’ begins with a parallel computation andestimates how much faster the parallel computation is thanthe same computation executing on a single processor.

Both laws ignore parallel overhead.

AdvancedComputing

M.A. Oliveira

Introduction


Amdahl’s law



TheKarp-Flattmetric

The Karp-Flatt metric

AdvancedComputing

M.A. Oliveira

Introduction


Amdahl’s law



TheKarp-Flattmetric


We define the experimentally determined serial fraction e of aparallel computation as:

T (n, p) = T (n, 1)e + T (n,1)p (1− e)

The Karp-Flat metric: Given a parallel computation exhibitingspeedup ψ on p processors the experimentally determined serialfraction e is:

e =1ψ− 1

p

1− 1p

AdvancedComputing

M.A. Oliveira

Introduction


Amdahl’s law



TheKarp-Flattmetric


The Karp-Flatt metric is useful for:

Takes into account parallel overhead

Helps detect sources of overhead and inefficiency.

AdvancedComputing

M.A. Oliveira

Intro

OpenMP

Part VI

OpenMP

AdvancedComputing

M.A. Oliveira

Intro

Definition

Pros & Cons

Architecture

Components

Model

Communications

OpenMPIntroduction

AdvancedComputing

M.A. Oliveira

Intro

Definition

Pros & Cons

Architecture

Components

Model

Communications

OpenMP

Definition

OpenMP is a parallel programming model for sharedmemory and distributed shared memory multiprocessors.

OpenMP is not a computer language. It can be used onFORTRAN and C/C++.

OpenMP requires a compliant compiler.

AdvancedComputing

M.A. Oliveira

Intro

Definition

Pros & Cons

Architecture

Components

Model

Communications

OpenMP

Advantages and Disadvantages of OpenMP

Advantages

Easier to learn.Parallelization can be incremental.Coarse-grain and fine-grain parallelization.Widely available.Portable.

Disadvantages

Limited Scalability.Impossible to use directly on distributed memory systems.

AdvancedComputing

M.A. Oliveira

Intro

Definition

Pros & Cons

Architecture

Components

Model

Communications

OpenMP

Components

Application

Threads in Operating System

Runtime Libraries

User

Environment VariablesCompiler Directives

AdvancedComputing

M.A. Oliveira

Intro

Definition

Pros & Cons

Architecture

Components

Model

Communications

OpenMP

Components

Compiler Directives.

Run-time libraries.

Environment variables.

AdvancedComputing

M.A. Oliveira

Intro

Definition

Pros & Cons

Architecture

Components

Model

Communications

OpenMP

Execution Model: The fork/join model

Parallel regions are the building “blocks” within the code.

A master thread is started at runtime and persiststhroughout execution.

The master thread starts the team of threads at parallelregions.

joinMas

ter

thre

ad fork

AdvancedComputing

M.A. Oliveira

Intro

Definition

Pros & Cons

Architecture

Components

Model

Communications

OpenMP

Thread Communication

Every thread has access to global memory - shared memory

Every thread has access to stack memory - private memory.

Code should use shared memory to communicate betweenthreads.

Simultaneous updates to shared memory can create a racecondition. Results change with different thread scheduling.

Use mutual exclusion to avoid data sharing but don’t usetoo much because it will serialize performance.

AdvancedComputing

M.A. Oliveira

Intro

OpenMP

Syntax andSentinels

Parallel Control

CombinedDirectives

DataEnvironment

DirectiveClauses

OpenMP

AdvancedComputing

M.A. Oliveira

Intro

OpenMP

Syntax andSentinels

Parallel Control

CombinedDirectives

DataEnvironment

DirectiveClauses

Syntax and Sentinels

AdvancedComputing

M.A. Oliveira

Intro

OpenMP

Syntax andSentinels

Parallel Control

CombinedDirectives

DataEnvironment

DirectiveClauses

Parallel Control Directives

OpenMP provides two kinds of directives to control parallelism:

To create multiple threads.

To divide work among an existing set of parallel threads.

AdvancedComputing

M.A. Oliveira

Intro

OpenMP

Syntax andSentinels

Parallel Control

CombinedDirectives

DataEnvironment

DirectiveClauses

Combined Directives

Some directives can however be combined together:

AdvancedComputing

M.A. Oliveira

Intro

OpenMP

Syntax andSentinels

Parallel Control

CombinedDirectives

DataEnvironment

DirectiveClauses

Data Environment

Only the master thread has a data address space that lastsfor the entire duration of the run.

During the fork operation OpenMP chooses to either sharea single copy between all the threads or provide eachthread with its own private copy for the duration of theparallel construct.

There are defaults for choosing if a variable is shared orprivate.

User is advised to always specify the character of thevariable.

AdvancedComputing

M.A. Oliveira

Intro

OpenMP

Syntax andSentinels

Parallel Control

CombinedDirectives

DataEnvironment

DirectiveClauses

Data Environment

Default behavior can be overridden with data scopingclauses:

FORTRAN

!$OMP PARALLEL DEFAULT(NONE) &

!$OMP SHARED(...) PRIVATE(...) REDUCTION(...)

C/C++

#pragma omp parallel default(none) \

shared(...) private(...) reduction(...)

Initialization of private variables can also be controlled withfirstprivate.

Return value of private variables can also be controlledwith lastprivate.

AdvancedComputing

M.A. Oliveira

Intro

OpenMP

Syntax andSentinels

Parallel Control

CombinedDirectives

DataEnvironment

DirectiveClauses

Data Environment: Reductions

An “operation” that combines multiple elements to form asingle result is called a reduction operation.

The variable that accumulates the result is called thereduction variable.

Reduction operators and variables must be declared.

sum=0

!$omp parallel do reduction(+ : sum)

do i = 1, n

sum = sum + a(i)

enddo

AdvancedComputing

M.A. Oliveira

Intro

OpenMP

Syntax andSentinels

Parallel Control

CombinedDirectives

DataEnvironment

DirectiveClauses

Directive Clauses

Clauses control the behavior of OpenMP directives.

AdvancedComputing

M.A. Oliveira

Synchronisation

Runtimelibraries

Environmentvariables

Compilingand Runningan OpenMPprogramme

Part VII

More OpenMP

AdvancedComputing

M.A. Oliveira

Synchronisation

Runtimelibraries



Synchronisation

AdvancedComputing

M.A. Oliveira

Synchronisation

Runtimelibraries



Synchronisation

The term synchronization refers to the mechanism by which aparallel program can coordinate the execution of multiplethreads. The two most common forms of synchronization:

Mutual exclusion: way to control access to a sharedvariable - critical/atomic directive.

Event synchronization: way to control that threads reach aparticular point in execution simultaneously - barrierdirective.

AdvancedComputing

M.A. Oliveira

Synchronisation

Runtimelibraries



Synchronisation

AdvancedComputing

M.A. Oliveira

Synchronisation

Runtimelibraries



Runtime libraries

AdvancedComputing

M.A. Oliveira

Synchronisation

Runtimelibraries



Environment variables

AdvancedComputing

M.A. Oliveira

Synchronisation

Runtimelibraries



Compiling

Running

Compiling and Running an OpenMPprogramme

AdvancedComputing

M.A. Oliveira

Synchronisation

Runtimelibraries



Compiling

Running

Compiling an OpenMP program

Compiling an OpenMP program depends on the compiler:

Intel Fortran: ifort -openmp input file

Pathscale Fortran: pathf90 -mp input file

GCC: gfortran -mopenmp input file (version > 4.2)

AdvancedComputing

M.A. Oliveira

Synchronisation

Runtimelibraries



Compiling

Running

Running an OpenMP program

_maolive@phorcys:~> pathf90 -mp hello3.f90 -o hello3

_maolive@phorcys:~> export OMP_NUM_THREADS=5

_maolive@phorcys:~> ./hello3

Hello world from thread 1 !





AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Part VIII

Introduction to MPI

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

MPI

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Definition

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Definition

MPI is a parallel programming model for message passingmultiprocessing.

It consists of a set of library calls to enable the multipleprocesses to communicate.

In MPI all instances of your parallel application are startedfrom launch and have private address spaces.

MPI is not a computer language. It can be used onFORTRAN and C/C++.

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Advantages and Disadvantages

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Advantages and Disadvantages

Advantages

ScalabilityCompatible even with shared memory systemsWidely availablePortable

Disadvantages

Harder to learnNon incremental parallelization

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Fundamentals

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Fundamentals

MPI is based on the message passing paradigm with aminimal interface as:

send(address,count,datatype,destination,tag,comm)receive(address,maxcount,datatype,source,tag,comm)

Messages can be either point-to-point or collective and canbe blocking or non-blocking.

Processes belong to a group. In each group a process isidentified by a its rank.

Each group of processes is characterized by an objectcalled a communicator.

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Basic MPI

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Basic MPI Tasks

Initialization and Termination

Setting up Communicators

Point to Point Communication

Collective Communication

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Basic Routines

MPI Init Initialize MPI

MPI Comm size Find out how many processes there are

MPI Comm rank Find out which process I am

MPI Send Send a message

MPI Recv Receive a message

MPI Finalize Terminate MPI

MPI Bcast Broadcast a message

MPI Reduce Reduce a message

MPI Barrier Blocks until all processes reach the barrier

MPI Abort Terminates MPI with an error code

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Beyond Basic?

All other functions just add:

flexibility (datatypes),robustness (nonblocking send/receive),efficiency,modularity (groups, communicators),convenience (collective operations, topologies).

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Example

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Initialization and Termination

program param

include mpif.h

call mpi_init(ierr)

call mpi_comm_size(MPI_COMM_WORLD,np,ierr)

call mpi_comm_rank(MPI_COMM_WORLD,mype,ierr)

...

do_something_useful

...

call mpi_finalize(ierr)

end program

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Compiling

Running

Compiling and Running an MPI program

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Compiling

Running

Compiling

Generally use a special compiler or compiler wrapper script:

not defined by the standard

consult your implementation

handles correct include path, library path, and libraries

MPICH-style (the most common)C

mpicc –o foo foo.c

Fortran

mpif77 –o foo foo.f

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Compiling

Running

Running

MPI programs require some help to get started

what computers should I run on?

how do I access them?

MPICH-style

mpirun –np 10 –machinefile mach ./foo

When batch schedulers are involved, all bets are off

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Part IX

MPI

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P


AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P


Collective Communications

Involves all processes within a communicator

There are three basic types:

SynchronizationData movementData movement with computation

MPI Collectives are blocking

MPI Collectives do not use message tags

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Synchronization: Barrier

A barrier is a synchronization primitive.A node calling it will block until all the nodes within the grouphave called it.

MPI BARRIER(COMM, IERR)

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Data Movement: Broadcast

MPI provides the broadcast primitive to send data to allcommunicator members:MPI BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM, IERR)

It must be called by each node in the group with the sameCOMM and ROOT.

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Data Movement: Scatter, Gather

Gather and scatter are inverse operations.Gather collects data from every member of the group onthe root node in linear order by the rank of the node.Scatter parcels out data from the root to every member ofthe group in linear order by node.

MPI GATHER(SNDBUF,SCOUNT,DATATYPE,RECVBUF,RCOUNT,RDATATYPE,ROOT,COMM,IERR)

MPI SCATTER(SNDBUF,SCOUNT,DATATYPE,RECVBUF,RCOUNT,RDATATYPE,ROOT,COMM,IERR)

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Data Movement: All Gather

Provides a more efficient way to do a gather followed by abroadcast: all members of the group receive the collected data.

MPI ALLGATHER(SNDBUF, SCOUNT, SDATATYPE, RECVBUF, RCOUNT, RDATATYPE, COMM, IERR)

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Data Movement: All to All

The jth block sent from node i is received by node j and isplaced in the ith block.This is typically useful for implementing Fast Fourier Transform.

MPI ALLTOALL(SNDBUF, SCOUNT, SDATATYPE, RECVBUF, RCOUNT, RDATATYPE, COMM, IERR)

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Advanced Data Movement Primitives

MPI {Scatter,Gather,Allgather,Alltoall}vv stands varying size, relative location of messages

Advantages:

flexibilityless need to copy data into temporary buffersmore compact

Disadvantages

Harder to program - more bookkeeping

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Generalized Data Movement Primitives

Scatter vs Scatterv

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Data movement with computation: Reduce

A global reduction combines partial results from each node inthe group using some basic function and distributes the answerto the root node:

MPI REDUCE(SNDBUF, RECVBUF, COUNT, DATATYPE, OPERATOR, ROOT, COMM, IERR)

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Data movement with computation: AllReduce

Variants of the reduce operation where the result is returned toall processes in the group.

MPI ALLREDUCE(SNDBUF,RECVBUF,COUNT,DATATYPE,OPERATOR,COMM,IERR)

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Data movement with computation: Scan

A scan performs partial reductions on distributed data.Specifically, the scan operation returns the reduction of thevalues in the send buffers of processes ranked 0, 1, ..., n intothe receive buffer of the node ranked n.MPI SCAN(SNDBUF, RECVBUF, COUNT, DATATYPE, OPERATOR, COMM, IERR)

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Data movement with computation: The operations

MPI SUM sumMPI PROD productMPI MAX maximum valueMPI MIN minimum valueMPI MAXLOC max. value locationMPI MINLOC min. value locationMPI LAND Logical andMPI LOR Logical orMPI LXOR Logical xorMPI BAND Bitwise andMPI BOR Bitwise orMPI BXOR Bitwise xor

Table: Common predefined operations for MPI collectives

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock Point to Point Communication

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Deadlock

Deadlock is a phenomenon most common with blockingcommunication. It occurs when all tasks are waiting for eventsthat haven’t been initiated yet.

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Communication Behavior

Point to point communication functions have two distinctbehaviors: blocking and non-blocking.

Blocking

Execution is suspended until the message buffer is safe toreuse.

Non-blocking

Execution is not suspended. User is responsible for testingor waiting until message buffer is safe to reuse.

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Communication Modes

For each communication behavior there are four modes:

Standard

Implementation dependent. Should be best compromise forreliability and performance.

Synchronous

The sending task sends the receiver a “ready to send”message. When receiver replies with “ready to receive”data is transferred.

Ready

When called a “ready to receive” message has to havearrived otherwise an error is generated.

Buffered

Copies the message to an user supplied buffer and returns.

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Unidirectional Point to point Communication

Mode Blocking Non-blocking

Standard MPI Send MPI Isend

Synchronous MPI Ssend MPI ISsend

Ready MPI Rsend MPI IRsend

Buffered MPI Bsend MPI IBsend

MPI Recv MPI IRecv

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Blocking Standard Mode

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Blocking Synchronous Mode

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Blocking Ready Mode

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Blocking Buffered Mode

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Non-Blocking Standard Mode

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Point to point Communications

Ready Mode has least total overhead. However the assumption is thatreceive is already posted.

Synchronous mode is portable and “safe”. It does not depend on order(ready) or buffer space (buffered). However it incurs substantial overhead.

Buffered mode decouples sender from receiver. No sync. overhead onsending task and order of execution does not matter (ready). User cancontrol size of message buffers and total amount of space. Howeveradditional overhead may be incurred by copy to buffer and buffer space canrun out.

Standard Mode is implementation dependent. Small messages are generallybuffered (avoiding sync. overhead) and large messages are usually sentsynchronously (avoiding the required buffer space).

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

To block or not to block?!

This code worked on one machine but does not work in general.Why?

! SEND DATA

LM=6*NES+2

DO I=1,NUMPRC

NT=I-1

IF (NT.NE.MYPRC) THEN

print *,myprc,'send',msgtag,'to',ntCALL MPI_SEND(NWS,LM,MPI_INTEGER,NT,MSGTAG, MPI_COMM_WORLD,IERR)

ENDIF

END DO

! RECEIVE DATA

LM=6*100+2

DO I=2,NUMPRC

CALL MPI_RECV(NWS,LM,MPI_INTEGER, & MPI_ANY_SOURCE,MSGTAG,MPI_COMM_WORLD,IERR)

! do something with data

END DO

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Bidirectional point to point communications

The MPI SENDRECV function provides an efficient way tohandle a common situation where a processor needs to senddata to another processor and receive data from the sameprocessor, or a different one. Calling MPI SENDRECV is similarto calling MPI SEND followed by a call to MPI RECV.MPI SENDRECV(sendbuf, sendcount, sendtype, dest, sendtag, recvbuf, recvcount, recvtype, source, recvtag,comm, status)

MPI SENDRECV REPLACE is a version of the previousprimitive that allows the send buffer and receive buffer to bethe same, which in effect replaces the contents of the sendbuffer by that of the received buffer.MPI SENDRECV REPLACE(buf,count,datatype,dest,sendtag, source,recvtag,comm,status,ierr)

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Avoiding Deadlock

different ordering of calls

non-blocking calls

bidirectional calls

buffered mode

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Part X

More MPI

AdvancedComputing

M.A. Oliveira

Datatypes

Basic Types

D Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Derived Datatypes

AdvancedComputing

M.A. Oliveira

Datatypes

Basic Types

D Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Basic Types

MPI Datatype Fortran Datatype

MPI BYTE

MPI CHARACTER CHARACTER

MPI COMPLEX COMPLEX

MPI DOUBLE PRECISION DOUBLE PRECISION

MPI INTEGER INTEGER

MPI LOGICAL LOGICAL

MPI PACKED

MPI REAL REAL

Table: Basic predefined datatypes for Fortran

AdvancedComputing

M.A. Oliveira

Datatypes

Basic Types

D Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Derived Datatypes

MPI basic data-types are predefined for contiguous data ofsingle type

What if application has data of mixed types, ornon-contiguous data?

existing solutions of multiple calls or copying into bufferand packing etc. are slow, clumsy and waste memorybetter solution is creating/deriving datatypes for thesespecial needs from existing datatypes

Derived Data-types can be created recursively and atrun-time

Automatic packing and unpacking

AdvancedComputing

M.A. Oliveira

Datatypes

Basic Types

D Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Derived Datatypes

Elementary: Language-defined types

Contiguous: Vector with stride of one

Vector: Separated by constant “stride”

Hvector: Vector, with stride in bytes

Indexed: Array of indices (for scatter/gather)

Hindexed: Indexed, with indices in bytes

Struct: General mixed types (for C structs etc.)

AdvancedComputing

M.A. Oliveira

Datatypes

Basic Types

D Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Derived Datatypes

call MPI_Type_contiguous(count, datatype, newdatatype, ierr)

call MPI_Type_commit(newdatatype, ierr)

...

call MPI_Type_free(newdatatype,ierr)

call MPI_Type_vector(ncols, 1, nrows, MPI_DOUBLE_PRECISION, vtype, ierr)

call MPI_Type_commit(vtype, ierr)

...

call MPI_Send( A(nrows,1), 1, vtype ...)

...

call MPI_Type_free(vtype,ierr)

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Communicators and Groups

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O


All MPI communication is relative to a communicator whichcontains a context and a group.The group is just a set of processes.

10

MPI_COMM_WORLD

COMM_1

COMM_2

0 21

3 4

432

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O


To handle communicators and groups we can usually deploy twostrategies:

Group manipulation - More general.

Direct communicator manipulation - Usually more compactand suitable for regular decompositions.

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O


First strategy

MPI_Init(&argc,&argv);

world = MPI_COMM_WORLD;

MPI_Comm_size(world,&numprocs);

MPI_Comm_rank(world,&myid);

server=numprocs-1;

MPI_Comm_group(world,&world_group);

ranks[0] = server;

MPI_Group_excl(world_group,1,ranks,&worker_group);

MPI_Comm_create(world,worker_group,&workers);

MPI_Group_free(&worker_group);

MPI_Group_free(&world_group);

...

MPI_Comm_free(&workers);

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O


Second strategy

color = ( myid == server);

MPI_Comm_split(world, color, 0, &workers);

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Topologies

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Topologies

Another way to use communicators is to orgnize tasks on agiven topology.

Cartesian topologies are predifined and allow to lay outMPI tasks on a cartesian coordinate grid.

(3,1)

(3,0)

(3,2)(2,2)(1,2)(0,2)

(0,1)

(0,0) (1,0) (2,0)

(1,1) (2,1)

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Wildcards

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Wildcards

Enables programmer to avoid having to specify a tagand/or a source:

MPI ANY TAGMPI ANY SOURCE

Enables programmer to keep algorithms general:MPI PROC NULL

can be used in send and/or receiveoperation complets immediatelyno communication involved

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Timing

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Timing

...

call MPI_Barrier(MPI_COMM_WORLD,ierr)

t=MPI_Wtime()

...

call MPI_Barrier(MPI_COMM_WORLD,ierr)

t=MPI_Wtime()-t

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Introduction

I/O types inMPI programs

Parallel MPII/O to a singlefile

MPI I/O

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Introduction



Introduction

Programming Languages have predefined functions tohandle files.

Typical operations are open, close, read and write.

MPI supports counterparts.

These MPI routines can conveniently express parallelism.

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Introduction



I/O types in MPI programs

Non-parallel I/O

I/O to separate files

Parallel I/O

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Introduction



Parallel MPI I/O to a single file

#include "mpi.h"

#include <stdio.h>

#define BUFSIZE 100

int main(int argc, char *argv[])

{

int i, myrank, buf[BUFSIZE];

MPI_File thefile;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

for (i=0; i<BUFSIZE; i++)

buf[i] = myrank * BUFSIZE + i;

MPI_File_open(MPI_COMM_WORLD, "testfile",

MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &thefile);

MPI_File_set_view(thefile, myrank * BUFSIZE * sizeof(int),

MPI_INT, MPI_INT, "native", MPI_INFO_NULL);

MPI_File_write(thefile, buf, BUFSIZE, MPI_INT, MPI_STATUS_IGNORE);

MPI_File_close(&thefile);

MPI_Finalize();

return 0;

}

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm

Part XI

Hybrid Programming

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm

Hybrid Programming

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm

Distributed and Shared Memory Systems

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm

Distributed and Shared Memory Systems

Combines distributed memory parallelization with on-nodeshared memory parallelization.

Largest systems now employ both architectures.

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm

Why Hybrid Computing

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm

Why Hybrid Computing

Eliminates domain decomposition at node

Automatic coherency at node

Lower memory latency and data movement within node

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

Modes of Hybrid Computing

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

Modes of Hybrid Computing - Tasks and Threads

Fixing the execution to a particular processing unit and memory bank canspeed up execution. Consider using “numactl”...

Running hybrid codes on a queuing system requires special configuration orskillful scripting.

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

Hybrid Coding

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

Modes of Hybrid Computing - Types

Support Level Description

MPI THREAD SINGLE Only one thread will execute.

MPI THREAD FUNNELED Process may be multithreadedbut only main thread can makeMPI calls. Default mode.

MPI THREAD SERIALIZE Process my be multithreaded,any thread can make MPIcalls but threads cannot exe-cute MPI calls concurrently.

MPI THREAD MULTIPLE Multiple threads may call MPI.No restrictions.

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

MPI Initialization

Fortran: call MPI Init thread( required, provided, ierr)

C: int MPI Init thread(int *argc, char ***argv, intrequired, int *provided)

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

Funneled Mode

MPI THREAD FUNNELED

Use OMP BARRIER since there is no implicit barrier inmaster work-share construct (OMP MASTER).

All other threads will be sleeping.

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

Funneled Mode

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

Serialized Mode

MPI THREAD SERIALIZED

Only OMP BARRIER at beginning, since there is animplicit barrier in SINGLE work-share construct(OMP SINGLE).

All other threads will be sleeping.

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

Serialized Mode

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

Multi-Threaded Mode

MPI THREAD MULTIPLE

No restrictions.

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

Multi-threaded Mode

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm

Overlapping Communication and Work

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm


One core can saturate the PCI to network bus. Why use allto communicate?

Communicate with one or several cores.

Work with others during communication.

Need at least MPI THREAD FUNNELED support.

Can be difficult to manage and load balance!

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm


AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm

Thread-rank Communication

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm


Can use thread and rank id in communication.

Usual technique in multi-thread is to use tags todistinguish threads.

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm


AdvancedComputing

M.A. Oliveira

Development

Part XII

Introduction to Software Development Tools

AdvancedComputing

M.A. Oliveira

Development

Introduction to Software DevelopmentTools

AdvancedComputing

M.A. Oliveira

Development

Introduction to Software Development Tools

Software Development Tools

Timers

Compiler Report & Listings

Debugger

Libraries

Compiler Optimizations

Profilers

Am I being naive?

Hardware Performance Counters

Does my code run?

What is compiler doing?

Time it!!!

Slow?

Is it optimal?

Is it really optimal?

Sucess!!!

Code Development

AdvancedComputing

M.A. Oliveira

Libraries

Part XIII

Libraries

AdvancedComputing

M.A. Oliveira

Libraries

Why ParallelLibraries?

PerformanceLibraries

OptimizedLibraries

Common HPCLibraries Libraries

AdvancedComputing

M.A. Oliveira

Libraries



OptimizedLibraries

Common HPCLibraries

Why Parallel Libraries?

Like most programming tasks, very little “real” software iscreated by starting from a blank slate and coding every lineof every algorithm.

Large scale parallel software construction involvessignificant code reuse, making use of libraries thatencapsulate much of what we learned.

AdvancedComputing

M.A. Oliveira

Libraries



OptimizedLibraries

Common HPCLibraries

Performance Libraries

Optimized for specific architectures.

Use library routines instead of hand-coding your own.

Offered by different vendors:

ESSL/PESSL on IBM systems.Intel MKL for IA32, EM64T and IA64.Cray libsci for Cray systems.SCSL for SGI.ACML for AMD.

AdvancedComputing

M.A. Oliveira

Libraries



OptimizedLibraries

Common HPCLibraries

Optimized Libraries

Use optimized libraries:

In “hot spots”, never write library functions by hand.Numerical Recipes books DO NOT provide optimized code.(Libraries can be 100x faster).

AdvancedComputing

M.A. Oliveira

Libraries



OptimizedLibraries

Common HPCLibraries

Common HPC Libraries

SPRNG - Parallel Random Numbers

FFTW - Parallel FFT (MPI, OpenMP)

ScaLAPack - Parallel Linear Algebra (MPI)

GOTOATLASIntel Math Kernel Libraries (MKL)

Parallel Linear Algebra+ (OpenMP)

PETSc - Parallel PDEs and related problems (MPI)

AdvancedComputing

M.A. Oliveira

CompilerOptimization

Part XIV

Compiler Optimization

AdvancedComputing

M.A. Oliveira


Introduction

OptimizationLevels

Intel CompilerOptions

PGI CompilerOptions

Conclusions

Compiler Optimization

AdvancedComputing

M.A. Oliveira


Introduction

OptimizationLevels


PGI CompilerOptions

Conclusions

Compiler Optimizations

The compiler now does a very good job of optimizing codeso you don’t have to.

But, program developers should ensure that their codes areadaptable to hardware evolution and are scalable.

AdvancedComputing

M.A. Oliveira


Introduction

OptimizationLevels


PGI CompilerOptions

Conclusions

Optimization Level: -On

-O0 no optimization: Fast compilation, disablesoptimization.

-O1 optimization for speed, keeps code size small.

-O2 low to moderate optimization: partial debuggingsupport, disables inlining.

-O3 aggressive optimization: compile time/ space intensiveand/or marginal effectiveness; may change code semanticsand results (sometimes even breaks codes!)

AdvancedComputing

M.A. Oliveira


Introduction

OptimizationLevels


PGI CompilerOptions

Conclusions

Optimization Levels

Operations performed at moderate optimization levels

instruction reschedulingcopy propagationsoftware pipeliningcommon subexpression eliminationprefetching, loop transformations

Operations performed at aggressive optimization levels

enables -O3more aggressive prefetching, loop transformations

AdvancedComputing

M.A. Oliveira


Introduction

OptimizationLevels


PGI CompilerOptions

Conclusions

Intel Compiler Options

Processor-specific optimization options

-xT generates specialized code for EM64T, includes SSE4

Other optimization options:

-mp maintain floating point precision (disables someoptimizations).-mp1 improve floating-point precision (speed impact is lessthan -mp).-ip enable single-file interprocedural (IP) optimizations(within files).-ipo enable multi-file IP optimizations (between files)

AdvancedComputing

M.A. Oliveira


Introduction

OptimizationLevels


PGI CompilerOptions

Conclusions

Intel Compiler Options

Other options:

-g debugging information, generates symbol table.

-strict ansi strict ANSI compliance.

-C enable extensive runtime error checking.

-convert kwd specify file format keyword: big endian, cray,ibm, little endian, native, vaxd

-openmp enable the parallelizer to generate multi-threadedcode based on the OpenMP directives.

-static create a static executable for serial applications.

AdvancedComputing

M.A. Oliveira


Introduction

OptimizationLevels


PGI CompilerOptions

Conclusions

Intel Compiler - Best Practice

Normal compiling ifort –O3 –xT test.c

Try compiling at -O3 -xT.

If code breaks or gives wrong answers with -O3 xT, firsttry -mp (maintain precision).

-O2 is default opt, compile with -O0 if this breaks (veryrare)

-xT can include optimizations and may break some codes

Don’t include debug options for a production compile!

ifort -O2 -g test.c

AdvancedComputing

M.A. Oliveira


Introduction

OptimizationLevels


PGI CompilerOptions

Conclusions

PGI Compilers

-03 performs some compile time and memory intensiveoptimizations in addition to those executed with -O2, butmay not improve performance for all programs.

Mipa=fast Interprocedural optimizations There is a loaderproblem with this option.

-tp barcelona-64 includes specialized code for the barcelonachip.

-fast

-O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline-Mvect=sse -Mscalarsse -Mcache align -Mflushz

-mp enable the parallelizer to generate multi-threaded codebased on the OpenMP directives.

-Minfo=mp,ipa Information about OpenMP,interprocedural optimization-helplists options.

AdvancedComputing

M.A. Oliveira


Introduction

OptimizationLevels


PGI CompilerOptions

Conclusions

Tuning Parameters

Different for each compiler.

Differences even between compiler versions.

Make sure you test and use them. You may missing out onsome well deserved performance boost.But, proceed with caution. ,

AdvancedComputing

M.A. Oliveira

Debuggingand Profiling

AnalysesTools

Part XV

Debugging and Profiling

AdvancedComputing

M.A. Oliveira


StandardDebuggers

DebuggingBasics

CommercialDebuggers

AnalysesTools

Debugging and Profiling

AdvancedComputing

M.A. Oliveira


StandardDebuggers

DebuggingBasics

CommercialDebuggers

AnalysesTools

Standard Debuggers

The standard command line debugging tool is gdb inLinux. You can use this debugger for programs written inC, C++ and Fortran.

There are graphical frontends for gdb.

AdvancedComputing

M.A. Oliveira


StandardDebuggers

DebuggingBasics

CommercialDebuggers

AnalysesTools

Debugging Basics

For effective debugging a couple of commands need to bemastered:

set breakpoints: regular and conditionaldisplay the value of variablesset new valuesstep through a program

Less used commands can be learned as they becomenecessary.

AdvancedComputing

M.A. Oliveira


StandardDebuggers

DebuggingBasics

CommercialDebuggers

AnalysesTools

gdb Basics

Common commands for gdb:

run - starts the program; if you do not set up anybreakpoints the program will run until it terminates or coredumps.

print - prints a variable located in the current scope.

next - executes the current command, and moves to thenext command in the program.

step - steps through the next command.

break - sets a break point.

continue - used to continue till next breakpoint ortermination

Note: if you are at a function call, and you issue next, then thefunction will execute and return. However, if you issue step,then you will go to the first line of that function

AdvancedComputing

M.A. Oliveira


StandardDebuggers

DebuggingBasics

CommercialDebuggers

AnalysesTools

gdb Basics

More commands for gdb:

list - show code listing near the current execution location

delete - delete a breakpoint

condition - make a breakpoint conditional

display - continuously display value

undisplay - remove displayed value

where - show current function stack trace

help - display help text

quit - exit gdb

AdvancedComputing

M.A. Oliveira


StandardDebuggers

DebuggingBasics

CommercialDebuggers

AnalysesTools

Commercial Debuggers: DDT & Totalview

Interactive, parallel, symbolic debuggers with GUI interface.

Works with C, C++ and Fortran Compilers

Available for many different platforms.

Supports OpenMP & MPI (and hybrid paradigm)

Support 32- and 64-bit architectures

Simple to use (intuitive)

AdvancedComputing

M.A. Oliveira


AnalysesTools

Goals

CompilerReports &Listings

Timers

Profilers

HardwarePerformanceCounters

Analyses Tools

AdvancedComputing

M.A. Oliveira


AnalysesTools

Goals


Timers

Profilers


Performance Analysis Goals

Identify hotspot candidates for further study andoptimization potential

Test optimization changes to verify usefulness

Floating-point improvements aimed at reducing overallwall-clock run time (but may potentially reduce scalability)

MPI improvements aimed at reducing MPI Idle time andimprove scalability

AdvancedComputing

M.A. Oliveira


AnalysesTools

Goals


Timers

Profilers


Compiler Reports & Listings

Compilers will optionally generate optimization reports &listing files.

Use the Loader Map to determine what libraries you haveloaded.

To activate them:

-Minfo=time,loop,inline,sym... (pgi)-opt-report (optimization,Intel)-S (listing,Intel)

AdvancedComputing

M.A. Oliveira


AnalysesTools

Goals


Timers

Profilers


Timers

AdvancedComputing

M.A. Oliveira


AnalysesTools

Goals


Timers

Profilers


Profilers

Profilers determine:

Call Tree

Wall clock/CPU time spent on each function

HW counters (e.g., cache misses, FLOPs)

A common Unix profiling tool is gprof.

AdvancedComputing

M.A. Oliveira


AnalysesTools

Goals


Timers

Profilers


Profilers

AdvancedComputing

M.A. Oliveira


AnalysesTools

Goals


Timers

Profilers


Hardware Performance Counters

Information obtained with profilers is usual extremelyhelpful but may not be enough to completely optimize acode.

Hardware performance counters usually requireinstrumenting the code but give a much more detailedview.

There are several solutions:

PAPITAUPDT

AdvancedComputing

M.A. Oliveira

HPCAvailableResources

Part XVI

HPC Available Resources

AdvancedComputing

M.A. Oliveira


HPC AvailableResources:Portugal

HPC AvailableResources:Europe HPC Available Resources

AdvancedComputing

M.A. Oliveira



HPC AvailableResources:Europe

HPC Available Resources: Portugal

MILIPEIA - http://www.lca.uc.pt

INGRID - http://www.gridcomputing.pt

RNCA - http://www.rnca.org.pt

AdvancedComputing

M.A. Oliveira



HPC AvailableResources:Europe

HPC Available Resources: Europe

HPCEuropa - http://www.hpc-europa.org

PRACE - http://www.prace-project.eu

Advanced Computing M.A. Oliveira Topics in Advanced...

Documents

Transcript of Advanced Computing M.A. Oliveira Topics in Advanced...