Advanced Computing M.A. Oliveira Topics in Advanced...

244
Advanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira Laborat´oriodeInstrumenta¸c˜ ao e F´ ısica Experimental de Part´ ıculas LIP Instituto de Telecomunica¸c˜ oes - P´ olo de Coimbra

Transcript of Advanced Computing M.A. Oliveira Topics in Advanced...

Page 1: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

OutlineTopics in Advanced Computing

Miguel Afonso Oliveira

Laboratorio de Instrumentacao e Fısica Experimental de Partıculas

LIP

Instituto de Telecomunicacoes - Polo de Coimbra

Page 2: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP

Part VII: MoreOpenMP

Part VIII: Introto MPI

Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid

Part XII:Development

Part XIII:Libraries

Part XIV:Compilers

Part XV:Debug & Prof

Part XVI:Resources

Outline

Page 3: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP

Part VII: MoreOpenMP

Part VIII: Introto MPI

Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid

Part XII:Development

Part XIII:Libraries

Part XIV:Compilers

Part XV:Debug & Prof

Part XVI:Resources

Part I: Introduction to Parallel Computing

1 Science and ComputersClassical Scientific MethodModern Scientific MethodThe Grand Challenges in Science

2 Motivating ParallelismComputational Power ArgumentMemory/Disk Speed ArgumentThe Data Communication Argument

3 Parallel ComputingWhat is Parallel Computing?Why do Parallel Computing?

4 Message

Page 4: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP

Part VII: MoreOpenMP

Part VIII: Introto MPI

Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid

Part XII:Development

Part XIII:Libraries

Part XIV:Compilers

Part XV:Debug & Prof

Part XVI:Resources

Part II: Parallel Programing Platforms

5 Parallel ComputersFlynn’s Taxonomy: Early daysFlynn’s Taxonomy: RecentlyMemory model Taxonomy

6 Parallel ProgrammingThe Two Extreme ModelsParallel Programming: The Real World

Page 5: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP

Part VII: MoreOpenMP

Part VIII: Introto MPI

Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid

Part XII:Development

Part XIII:Libraries

Part XIV:Compilers

Part XV:Debug & Prof

Part XVI:Resources

Part III: Principles of Parallel Algorithm Design

7 The Task/Channel Model

8 Foster’s Design Methodology

9 Message

Page 6: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP

Part VII: MoreOpenMP

Part VIII: Introto MPI

Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid

Part XII:Development

Part XIII:Libraries

Part XIV:Compilers

Part XV:Debug & Prof

Part XVI:Resources

Part IV: Computer Architecture and Efficiency

10 Introduction11 Memory Hierarchy

IntroductionCachesVirtual Memory

12 Designing for memory hierarchiesIntroductionTemporal locality

Inline small functionsif-then-elsecompact loops

Spacial localityLoop index orderBlockingBlock reorderingDynamic data structures

13 Loop unrolling14 Library usage

Page 7: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP

Part VII: MoreOpenMP

Part VIII: Introto MPI

Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid

Part XII:Development

Part XIII:Libraries

Part XIV:Compilers

Part XV:Debug & Prof

Part XVI:Resources

Part V: Performance Analysis

15 Introduction

16 Speedup and Efficiency

17 Amdahl’s law

18 Gustafson-Barsis’ Law

19 Summary on speedup limits

20 The Karp-Flatt metric

Page 8: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP

Part VII: MoreOpenMP

Part VIII: Introto MPI

Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid

Part XII:Development

Part XIII:Libraries

Part XIV:Compilers

Part XV:Debug & Prof

Part XVI:Resources

Part VI: OpenMP

21 IntroductionDefinitionAdvantages and Disadvantages of OpenMPArchitectureComponentsExecution ModelThread Communication

22 OpenMPSyntax and SentinelsParallel Control DirectivesCombined DirectivesData EnvironmentDirective Clauses

Page 9: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP

Part VII: MoreOpenMP

Part VIII: Introto MPI

Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid

Part XII:Development

Part XIII:Libraries

Part XIV:Compilers

Part XV:Debug & Prof

Part XVI:Resources

Part VII: More OpenMP

23 Synchronisation

24 Runtime libraries

25 Environment variables

26 Compiling and Running an OpenMP programmeCompilingRunning

Page 10: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP

Part VII: MoreOpenMP

Part VIII: Introto MPI

Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid

Part XII:Development

Part XIII:Libraries

Part XIV:Compilers

Part XV:Debug & Prof

Part XVI:Resources

Part VIII: MPI

27 MPI

28 Definition

29 Advantages and Disadvantages

30 Fundamentals

31 Basic MPI

32 Example

33 Compiling and Running an MPI programCompilingRunning

Page 11: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP

Part VII: MoreOpenMP

Part VIII: Introto MPI

Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid

Part XII:Development

Part XIII:Libraries

Part XIV:Compilers

Part XV:Debug & Prof

Part XVI:Resources

Part IX: MPI

34 Collective CommunicationSynchronizationData MovementAdvanced Data Movement PrimitivesData movement with computation

35 Point to Point CommunicationDeadlockUnidirectional point to point communicationsBidirectional point to point communicationsAvoid Deadlock

Page 12: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP

Part VII: MoreOpenMP

Part VIII: Introto MPI

Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid

Part XII:Development

Part XIII:Libraries

Part XIV:Compilers

Part XV:Debug & Prof

Part XVI:Resources

Part X: More MPI

36 Derived DatatypesBasic TypesDerived Datatypes

37 Communicators and Groups

38 Topologies

39 Wildcards

40 Timing

41 MPI I/OIntroductionI/O types in MPI programsParallel MPI I/O to a single file

Page 13: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP

Part VII: MoreOpenMP

Part VIII: Introto MPI

Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid

Part XII:Development

Part XIII:Libraries

Part XIV:Compilers

Part XV:Debug & Prof

Part XVI:Resources

Part XI: Hybrid programming

42 Hybrid Programming

43 Distributed and Shared Memory Systems

44 Why Hybrid Computing

45 Modes of Hybrid ComputingTasks and ThreadsHybrid CodingTypesMPI InitializationFunneled ModeSerialized ModeMulti-Threaded Mode

46 Overlapping Communication and Work

47 Thread-rank Communication

Page 14: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP

Part VII: MoreOpenMP

Part VIII: Introto MPI

Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid

Part XII:Development

Part XIII:Libraries

Part XIV:Compilers

Part XV:Debug & Prof

Part XVI:Resources

Part XII: Introduction to Software DevelopmentTools

48 Introduction to Software Development Tools

Page 15: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP

Part VII: MoreOpenMP

Part VIII: Introto MPI

Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid

Part XII:Development

Part XIII:Libraries

Part XIV:Compilers

Part XV:Debug & Prof

Part XVI:Resources

Part XIII: Libraries

49 LibrariesWhy Parallel Libraries?Performance LibrariesOptimized LibrariesCommon HPC Libraries

Page 16: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP

Part VII: MoreOpenMP

Part VIII: Introto MPI

Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid

Part XII:Development

Part XIII:Libraries

Part XIV:Compilers

Part XV:Debug & Prof

Part XVI:Resources

Part XIV: Compiler Optimization

50 Compiler OptimizationIntroductionOptimization LevelsIntel Compiler OptionsPGI Compiler OptionsConclusions

Page 17: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP

Part VII: MoreOpenMP

Part VIII: Introto MPI

Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid

Part XII:Development

Part XIII:Libraries

Part XIV:Compilers

Part XV:Debug & Prof

Part XVI:Resources

Part XV: Debugging and Profiling

51 Debugging and ProfilingStandard DebuggersDebugging BasicsCommercial Debuggers

52 Analyses ToolsGoalsCompiler Reports & ListingsTimersProfilersHardware Performance Counters

Page 18: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Outline

Part I:Computing

Part II:Platforms

Part III: Design

Part IV:Efficiency

Part V:Performance

Part VI:OpenMP

Part VII: MoreOpenMP

Part VIII: Introto MPI

Part IX: MPI

Part X: MoreMPI

Part XI: Hybrid

Part XII:Development

Part XIII:Libraries

Part XIV:Compilers

Part XV:Debug & Prof

Part XVI:Resources

Part XVI: HPC Available Resources

53 HPC Available ResourcesHPC Available Resources: PortugalHPC Available Resources: Europe

Page 19: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Science andComputers

MotivatingParallelism

ParallelComputing

Message

Part I

Introduction to Parallel Computing

Page 20: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Science andComputers

ClassicalScientificMethod

ModernScientificMethod

The GrandChallenges inScience

MotivatingParallelism

ParallelComputing

Message

Science and Computers

Page 21: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Science andComputers

ClassicalScientificMethod

ModernScientificMethod

The GrandChallenges inScience

MotivatingParallelism

ParallelComputing

Message

Classical Science

Nature  

Observa,on  

Theory  Experimenta,on  

Page 22: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Science andComputers

ClassicalScientificMethod

ModernScientificMethod

The GrandChallenges inScience

MotivatingParallelism

ParallelComputing

Message

Modern Science

Nature  

Observa,on  

Theory  Experimenta,on  Numerical  Simula,on  

Page 23: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Science andComputers

ClassicalScientificMethod

ModernScientificMethod

The GrandChallenges inScience

MotivatingParallelism

ParallelComputing

Message

The Grand Challenges in Science

Quantum Chemistry, Statistical Mechanics and RelativisticPhysics

Cosmology and Astrophysics

Computational Fluid Dynamics and Turbulance

Material Design and Superconductivity

Biology, Pharmacology, Genome Sequencing, GeneticEngineering, Protein Folding, Enzyme Activity and CellModeling

Medicine and Modeling of Human Organs and Bones

Global Weather and Environmental Modeling

Page 24: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Science andComputers

MotivatingParallelism

ComputationalPowerArgument

Memory/DiskSpeedArgument

The DataCommunicationArgument

ParallelComputing

Message

Motivating Parallelism

Page 25: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Science andComputers

MotivatingParallelism

ComputationalPowerArgument

Memory/DiskSpeedArgument

The DataCommunicationArgument

ParallelComputing

Message

Computational Power Argument

Moore’s Law: Circuit complexity doubles every eighteenmonths.

Generalized Moore’s Law: The amount of computing poweravailable at a given cost doubles every eighteen months.

Critical Issue:

How do we translate transistors into use useful operationsper second?How do we use the transistors to achieve increasing rates ofcomputation?

Logical Answer:

Rely on parallelism (implicit and explicit).

Page 26: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Science andComputers

MotivatingParallelism

ComputationalPowerArgument

Memory/DiskSpeedArgument

The DataCommunicationArgument

ParallelComputing

Message

Memory/Disk Speed Argument

Gap between processor speed and memory/disk presents atremendous performance bottleneck.

Usual compromise is to use hierarchy of caches...

Parallelism can also help because it yields bettermemory/disk system management:

larger aggregate caches.higher aggregate bandwidth.

Page 27: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Science andComputers

MotivatingParallelism

ComputationalPowerArgument

Memory/DiskSpeedArgument

The DataCommunicationArgument

ParallelComputing

Message

The Data Communication Argument

Networking infrastructure evolved tremendously.

Why not use it as an heterogenous parallel/distributedcomputing environment?

Voluntary Computing

SETI@HOME

IBERCIVIS...

Grid Computing

WLCG...

Page 28: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Science andComputers

MotivatingParallelism

ParallelComputing

What?

Why?

Message

Parallel Computing

Page 29: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Science andComputers

MotivatingParallelism

ParallelComputing

What?

Why?

Message

What is Parallel Computing?

Parallel Computing: use of multiple processing units orcomputers for a common task.

Each processing unit works on its section of the problem.

Processing units can exchange information.

PU_2 works on thisare of the problem

PU_3 works on thisare of the problem

PU_4 works on thisare of the problem

PU_1 works on thisare of the problem

Page 30: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Science andComputers

MotivatingParallelism

ParallelComputing

What?

Why?

Message

Why do Parallel Computing?

To compute beyond the limits of single PU systems:

achieve more performance;utilize more memory.

To be able to:

solve that can’t be solved in a reasonable time with singlePU systems;solve problems that don’t fit on a single PU system or evena single system.

So we can:

solve larger problems;solve problems faster;solve more problems.

Page 31: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Science andComputers

MotivatingParallelism

ParallelComputing

Message

Message

Page 32: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Science andComputers

MotivatingParallelism

ParallelComputing

Message

Message

If you can’t compute youcan’t compete!

Page 33: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Taxonomy

Programming

Part II

Parallel Programing Platforms

Page 34: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Taxonomy

Early days

Recently

Currently

Programming

Parallel Computers

Page 35: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Taxonomy

Early days

Recently

Currently

Programming

Early days: Flynn’s Taxonomy

Page 36: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Taxonomy

Early days

Recently

Currently

Programming

Recently: Flynn’s Taxonomy

SPMD

MIMD

Single Program Multiple Data Multiple Program Multiple Data

MPMD

Page 37: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Taxonomy

Early days

Recently

Currently

Programming

Memory Model Taxonomy

Mem

_N

Parallel Computers

Shared Memory Distributed Memory

CPU_0

CPU_1

CPU_N

Inte

rcon

nec

t

Mem

ory

Inte

rcon

nec

t

CPU_0

CPU_1

CPU_N

Mem

_0

Mem

_1

Page 38: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Taxonomy

Programming

The Models

Reality

Parallel Programming

Page 39: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Taxonomy

Programming

The Models

Reality

The Two Extreme Parallel Programming Models

MPI

Parallel Computers

Distributed MemoryShared Memory

Shared MemoryProgramming

OpenMP

Message−PassingProgramming

The distributed model can be used directly on a sharedmemory system.

Using the shared memory model on a distributed memorysystem is only possible indirectly.

Both models can be combined to optimize performance.

Page 40: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Taxonomy

Programming

The Models

Reality

Parallel Computers and Parallel Programming: TheReal World

RDMA

Homogeneous Systems Heterogeneous Systems

Parallel Programming

Distributed Partinioned

Global Address

Distributed

Memory

Shared Memory CPU+GPU CPU+Coprocessor

CPU+FPGA

Space − DGAS

Global Address

Space − PGAS

PVM

MPI

threads

OpenMP

Software

Hardware

OperatingSystem

Library/Language

Myrinet

InfinibandCluster

OpenMP

Intel

CAF

HPF

UPC

Chapel

IBM X10

SUN Fortress

CUDA

OpenCL

Page 41: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Model

Methodology

Message

Part III

Principles of Parallel Algorithm Design

Page 42: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Model

Methodology

Message

The Task/Channel Model

Page 43: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Model

Methodology

Message

The Task/Channel Model

Parallel Computation: set of tasks interacting with eachother by sending messages through channels.

Task: it’s a program, it’s local memory and a collection ofI/O ports.

Communication: tasks can send local data values to othertasks via output ports and receive data values from othertasks via input ports.

Channel: it’s a message queue connecting one task’soutput port to another task’s input port.

Synchronicity: Message reception is synchronous (receivermust wait for message to arrive - blocking). Sending isasynchronous (sender never blocks).

Page 44: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Model

Methodology

Message

The Task/Channel Model

Page 45: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Model

Methodology

Message

Foster’s Design Methodology

Page 46: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Model

Methodology

Message

Foster’s Design Methodology

A four step process for designing parallel algorithms:

Partitioning: Process of dividing the computation and thedata into primitive tasks. Can be done in a data-centricapproach (domain decomposition) or computation-centricapproach (functional decomposition).

Communication: Determining the communication patternbetween the primitive tasks and identifying local and globalcommunication.

Agglomeration: Process of grouping tasks into larger tasksin order to improve performance, simplify programmingand lower communication overhead.

Mapping: Process of assigning tasks to processors.

Page 47: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Model

Methodology

Message

Foster’s Design Methodology

Page 48: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Model

Methodology

Message

Message

Page 49: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Model

Methodology

Message

Message

Most programming problems have several parallel solutions.

The best solution may differ from that suggested byexisting sequential algorithms.

You should not parallelize a code (most times you end upparalyzing it!). The correct approach is to develop the parallelalgorithm.

Page 50: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Designing formemoryhierarchies

Loopunrolling

Library usage

Part IV

Computer Architecture and Efficiency

Page 51: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Designing formemoryhierarchies

Loopunrolling

Library usage

Introduction

Page 52: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Designing formemoryhierarchies

Loopunrolling

Library usage

Introduction

In order to produce “fast” code one needs to understandmodern computer hardware.

Crucial Concept: Memory Hierarchy.

Advice: Keep simple, intelligible (unoptimized) code incomments for future reference.

Page 53: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Introduction

Caches

Virtual Memory

Designing formemoryhierarchies

Loopunrolling

Library usage

Memory Hierarchy

Page 54: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Introduction

Caches

Virtual Memory

Designing formemoryhierarchies

Loopunrolling

Library usage

Memory Hierarchy

Page 55: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Introduction

Caches

Virtual Memory

Designing formemoryhierarchies

Loopunrolling

Library usage

Caches

Fastest memory are the registers. Operations act directlyon their values. Can be accessed in one clock cycle.

Cache is inside the CPU so is limited and fixed.

Caches are broken up into cache lines which are shortblocks of memory that copy memory ranges from mainmemory.

Blocks of memory in cache are indexed by a table inhardware.

Memory accesses from the CPU can be cache hits or cachemisses.

It is good programming practice to use memory sequentially tobe able to reuse cache lines and avoid cache misses.

Page 56: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Introduction

Caches

Virtual Memory

Designing formemoryhierarchies

Loopunrolling

Library usage

Virtual Memory

Virtual memory is the usage of storage devices to increasethe amount of available memory.

Blocks of virtual memory are called pages.

When memory is accessed a table is locked at to find itsphysical address. If that address is not in main memory apage is read from storage and the table updated, this iscalled a page fault.

Generating many page faults is called thrashing.

It is good programming practice to use memory sequentially tobe able to reuse pages and avoid page faults.

Page 57: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Designing formemoryhierarchies

Introduction

Temporallocality

Inline smallfunctions

if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering

Dynamic datastructures

Loopunrolling

Library usage

Designing for memory hierarchies

Page 58: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Designing formemoryhierarchies

Introduction

Temporallocality

Inline smallfunctions

if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering

Dynamic datastructures

Loopunrolling

Library usage

Introduction

Main principle: Locality

Temporal locality - executable codeSpacial locality - data

For good temporal locality code should avoid jumping toomuch. Most times a jump or a branch to a far-awayaddress occurs cache lines need to be filled with newinstructions.

For good spacial locality code should use memorysequentially.

Page 59: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Designing formemoryhierarchies

Introduction

Temporallocality

Inline smallfunctions

if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering

Dynamic datastructures

Loopunrolling

Library usage

Inline small functions

Consider the codes:

double x[N], y[N], z[N]; double d_add(double a, double b)

... { return a+b; }

for( i=0 ; i<N ; ++i ) ...

z[i]=x[i]+y[i]; for( i=0 ; i<N ; ++i )

z[i]=d_add(x[i],y[i]);

Function calls involve setting up, and later destroying, astack frame.

Function calls usually involve branching to somewhere faraway in memory.

Variables usually have to be saved from registers tomemory and vice-versa.

If you really need code like this consider inlining smallfunctions.

Page 60: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Designing formemoryhierarchies

Introduction

Temporallocality

Inline smallfunctions

if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering

Dynamic datastructures

Loopunrolling

Library usage

if-then-else

The if-then-else tend to reduce temporal locality. Avoid it ifpossible in crucial sections of the code like for loops as itreduces locality and makes it much more difficult for compilersto optimize the code.

Page 61: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Designing formemoryhierarchies

Introduction

Temporallocality

Inline smallfunctions

if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering

Dynamic datastructures

Loopunrolling

Library usage

Compact loops

Keeps your loops as compact as possible so achieve betterlocality and improve the chance of compiler optimizations.

Page 62: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Designing formemoryhierarchies

Introduction

Temporallocality

Inline smallfunctions

if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering

Dynamic datastructures

Loopunrolling

Library usage

Loop index order

Always order your loops according to the wiseness of yourlanguage.

C/C++(row-wise ordering) Fortran(colum-wise ordering)

for( i=0 ; i<m ; ++i ) do j=1,n

for( j=0 ; j<n ; ++j ) do i=1,m

y[i]=y[i]+a[i][j]*x[j] y(i)=y(i)+a(i,j)*x(j)

enddo

enddo

Page 63: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Designing formemoryhierarchies

Introduction

Temporallocality

Inline smallfunctions

if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering

Dynamic datastructures

Loopunrolling

Library usage

Blocking

In loops with multiple arrays it is often impossible that just byloop index reordering we achieve good spacial locality. This isthe case if both rows and columns are used simultaneously. Acommon practice is to operate on blocks of the matrices(sub-matrices). The goal is to maximize access to the dataloaded into cache before it gets replaced. This technique iscalled blocking.

Page 64: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Designing formemoryhierarchies

Introduction

Temporallocality

Inline smallfunctions

if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering

Dynamic datastructures

Loopunrolling

Library usage

Blocking

Page 65: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Designing formemoryhierarchies

Introduction

Temporallocality

Inline smallfunctions

if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering

Dynamic datastructures

Loopunrolling

Library usage

Blocking

Page 66: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Designing formemoryhierarchies

Introduction

Temporallocality

Inline smallfunctions

if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering

Dynamic datastructures

Loopunrolling

Library usage

Block reordering

Suppose we want to multiply two large matrices nxn and we usea block algorithm:

[A11 A12A21 A22

] [B11 B12B21 B22

]=

[A11B11 + A12B21 A11B12 + A12B22A21B11 + A22B21 A21B12 + A22B22

]

This algorithm can be applied recursively such that every blockfits in the available cache.In this case instead of storing the matrices in either row-wise orcolumn-wise order we can store them in a way that makes thealgorithm particularly convenient: to store A we first store A11,then A12, followed by A12 and A22. Each sub-matrix should berecursively stored in the same fashion.

Page 67: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Designing formemoryhierarchies

Introduction

Temporallocality

Inline smallfunctions

if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering

Dynamic datastructures

Loopunrolling

Library usage

Block reordering

For and 8x8 matrix we would have:

1 2 5 6 17 18 21 223 4 7 8 19 20 23 249 10 13 14 25 26 29 30

11 12 15 16 27 28 31 3233 34 37 38 49 50 53 5435 36 39 40 51 52 55 5641 42 45 46 57 58 61 6243 44 47 48 59 60 63 64

Such algorithm would produce a cache-oblivious algorithm.

Page 68: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Designing formemoryhierarchies

Introduction

Temporallocality

Inline smallfunctions

if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering

Dynamic datastructures

Loopunrolling

Library usage

Block reordering

Cache-oblivious algorithms have however several issues thatneed to be taken into account:

Computing The address of a specific entry is much morecomplex than for standard matrix layouts.

Natural algorithms are recursive which are difficult tooptimize by compilers and recursive routines cannot beinlined.

Natural base cases for the recursion is 1x1 or 2x2 blockmatrices which are too small for modern pipelined CPUs.

Page 69: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Designing formemoryhierarchies

Introduction

Temporallocality

Inline smallfunctions

if-then-else

compact loops

Spacial locality

Loop indexorder

Blocking

Blockreordering

Dynamic datastructures

Loopunrolling

Library usage

Dynamic data structures

Dynamic data structures, like lists and trees, are commonlyused in advanced algorithms.

Creation and destruction of nodes may destroy memorylocality.

How can we create dynamic data structures and still keepmemory locality?

Copy the data structure after creation to force sequentialnode creation before any critical calculation.Use allocation/deallocation routines that force memorylocality.

Page 70: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Designing formemoryhierarchies

Loopunrolling

Library usage

Loop unrolling

Page 71: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Designing formemoryhierarchies

Loopunrolling

Library usage

Loop unrolling

Consider the codes:

sum=0.0; sum0=sum1=sum2=sum3=0.0;

for(i=0;i<n;++i) for(i=0;i<n;i+=4)

sum=sum+a[i]*b[i] {

sum0=sum0+a[i]*b[i];

sum1=sum1+a[i+1]*b[i+1];

sum2=sum2+a[i+2]*b[i+2];

sum3=sum3+a[i+3]*b[i+3];

}

for(i=4*(n/4);i<n;++i)

sum0=sum0+a[i]*b[i];

Most CPUs use pipelined units to perform these operations. Inthe first version each multiply-add has to clear the pipeline tobe added to “sum”. On the second version the pipeline is keptfull and more multiply-adds are performed.This technique is called loop unrolling.How much loop unrolling is needed depends on the details of thearchitecture and the effectiveness of the compiler optimizations.

Page 72: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Designing formemoryhierarchies

Loopunrolling

Library usage

Library usage

Page 73: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

MemoryHierarchy

Designing formemoryhierarchies

Loopunrolling

Library usage

Library usage

In any algorithm performance hotspot NEVER, EVER useyour own routines for numerical operations.

ALWAYS use libraries.

ALWAYS choose CPU optimized libraries if available.

Typical - linear algebra:

Manufacturer versionATLAS or GOTOoperating system versionyour own version

Page 74: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

Speedup andEfficiency

Amdahl’s law

Gustafson-Barsis’Law

Summary onspeeduplimits

TheKarp-Flattmetric

Part V

Performance Analysis

Page 75: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

Speedup andEfficiency

Amdahl’s law

Gustafson-Barsis’Law

Summary onspeeduplimits

TheKarp-Flattmetric

Introduction

Page 76: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

Speedup andEfficiency

Amdahl’s law

Gustafson-Barsis’Law

Summary onspeeduplimits

TheKarp-Flattmetric

Introduction

Being able to accurately predict the performance of aparallel algorithm you have designed can help you decidewhether to actually go to the trouble of coding anddebugging it.

Being able to analyze the execution time exhibited by aparallel program can help you understand the barriers tohigher performance and predict how much improvementcan be realized by increasing the number of processors.

Page 77: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

Speedup andEfficiency

Amdahl’s law

Gustafson-Barsis’Law

Summary onspeeduplimits

TheKarp-Flattmetric

Speedup and Efficiency

Page 78: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

Speedup andEfficiency

Amdahl’s law

Gustafson-Barsis’Law

Summary onspeeduplimits

TheKarp-Flattmetric

Speedup and Efficiency

Speedup = Sequential execution timeParallel execution time = ψ(n, p)

Efficiency = Speedup# Processors = ε(n, p)

Page 79: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

Speedup andEfficiency

Amdahl’s law

Gustafson-Barsis’Law

Summary onspeeduplimits

TheKarp-Flattmetric

Speedup and Efficiency

Time in parallel algorithms is used in:

Sequential execution: σ(n)Parallel execution: ϕ(n)Parallel Overhead: κ(n, p)

Speedup and efficiency can hence be written as:

ψ(n, p) ≤ σ(n)+ϕ(n)

σ(n)+ ϕ(n)p +κ(n,p)

ε(n, p) ≤ σ(n)+ϕ(n)pσ(n)+ϕ(n)+pκ(n,p)

Page 80: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

Speedup andEfficiency

Amdahl’s law

Gustafson-Barsis’Law

Summary onspeeduplimits

TheKarp-Flattmetric

Amdahl’s law

Page 81: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

Speedup andEfficiency

Amdahl’s law

Gustafson-Barsis’Law

Summary onspeeduplimits

TheKarp-Flattmetric

Amdahl’s law

Consider the expression for the speedup ψ(n, p). Sinceκ(n, p) < 0 we have:

ψ(n, p) ≤ σ(n)+ϕ(n)

σ(n)+ ϕ(n)p +κ(n,p)

≤ σ(n)+ϕ(n)

σ(n)+ ϕ(n)p

Let f denote the sequential portion of the computation:

f = σ(n)σ(n)+ϕ(n)

Substituting we obtain:

ψ(n, p) ≤ 1f + 1−f

p

Page 82: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

Speedup andEfficiency

Amdahl’s law

Gustafson-Barsis’Law

Summary onspeeduplimits

TheKarp-Flattmetric

Amdahl’s law

Amdahl’s Law: Let f be the fraction of operations in acomputation that must be performed sequentially. Themaximum speedup ψ achievable by a parallel computer with pprocessors is:

ψ(n, p) ≤ 1f + 1−f

p

Corollary: The maximum achievable speedup by a parallelalgorithm with a fraction f of its code performed sequentially is:

limp→+∞

ψ(n, p) ≤ 1f

Page 83: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

Speedup andEfficiency

Amdahl’s law

Gustafson-Barsis’Law

Summary onspeeduplimits

TheKarp-Flattmetric

Amdahl’s Effect

Typically κ(n, p) has lower complexity than ϕ(n). Increasing thesize of the problem increases the computation time faster thanit increases the parallel overhead. Hence for a fixed number ofprocesses, speedup is usually an increasing function of problemsize. This is known as Amdahl’s effect.

Page 84: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

Speedup andEfficiency

Amdahl’s law

Gustafson-Barsis’Law

Summary onspeeduplimits

TheKarp-Flattmetric

Gustafson-Barsis’ Law

Page 85: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

Speedup andEfficiency

Amdahl’s law

Gustafson-Barsis’Law

Summary onspeeduplimits

TheKarp-Flattmetric

Gustafson-Barsis’ Law

Consider the expression for the speedup ψ(n, p). Sinceκ(n, p) < 0 we have:

ψ(n, p) ≤ σ(n)+ϕ(n)

σ(n)+ ϕ(n)p +κ(n,p)

≤ σ(n)+ϕ(n)

σ(n)+ ϕ(n)p

Let s denote the fraction of time spent in the parallelexecution performing sequential operations:

s = σ(n)

σ(n)+ ϕ(n)p

Substituting we obtain:

ψ(n, p) ≤ p + (1− p)s

Page 86: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

Speedup andEfficiency

Amdahl’s law

Gustafson-Barsis’Law

Summary onspeeduplimits

TheKarp-Flattmetric

Gustafson-Barsis’ Law

Gustafson-Barsis’ Law: Given a parallel program solving aproblem of size n using p processors, let s denote the fraction oftotal execution time spent in serial code. The maximumspeedup (usually called scaled speedup) ψ achievable by thisprogram is:

ψ(n, p) ≤ p + (1− p)s

Page 87: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

Speedup andEfficiency

Amdahl’s law

Gustafson-Barsis’Law

Summary onspeeduplimits

TheKarp-Flattmetric

Summary on speedup limits

Page 88: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

Speedup andEfficiency

Amdahl’s law

Gustafson-Barsis’Law

Summary onspeeduplimits

TheKarp-Flattmetric

Summary on speedup limits

Amdahl’s law determines speedup by taking the serialcomputation and predicting how quickly that computationcould execute on multiple processors.

Gustafson-Barsis’ begins with a parallel computation andestimates how much faster the parallel computation is thanthe same computation executing on a single processor.

Both laws ignore parallel overhead.

Page 89: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

Speedup andEfficiency

Amdahl’s law

Gustafson-Barsis’Law

Summary onspeeduplimits

TheKarp-Flattmetric

The Karp-Flatt metric

Page 90: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

Speedup andEfficiency

Amdahl’s law

Gustafson-Barsis’Law

Summary onspeeduplimits

TheKarp-Flattmetric

The Karp-Flatt metric

We define the experimentally determined serial fraction e of aparallel computation as:

T (n, p) = T (n, 1)e + T (n,1)p (1− e)

The Karp-Flat metric: Given a parallel computation exhibitingspeedup ψ on p processors the experimentally determined serialfraction e is:

e =1ψ− 1

p

1− 1p

Page 91: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Introduction

Speedup andEfficiency

Amdahl’s law

Gustafson-Barsis’Law

Summary onspeeduplimits

TheKarp-Flattmetric

The Karp-Flatt metric

The Karp-Flatt metric is useful for:

Takes into account parallel overhead

Helps detect sources of overhead and inefficiency.

Page 92: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Intro

OpenMP

Part VI

OpenMP

Page 93: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Intro

Definition

Pros & Cons

Architecture

Components

Model

Communications

OpenMPIntroduction

Page 94: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Intro

Definition

Pros & Cons

Architecture

Components

Model

Communications

OpenMP

Definition

OpenMP is a parallel programming model for sharedmemory and distributed shared memory multiprocessors.

OpenMP is not a computer language. It can be used onFORTRAN and C/C++.

OpenMP requires a compliant compiler.

Page 95: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Intro

Definition

Pros & Cons

Architecture

Components

Model

Communications

OpenMP

Advantages and Disadvantages of OpenMP

Advantages

Easier to learn.Parallelization can be incremental.Coarse-grain and fine-grain parallelization.Widely available.Portable.

Disadvantages

Limited Scalability.Impossible to use directly on distributed memory systems.

Page 96: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Intro

Definition

Pros & Cons

Architecture

Components

Model

Communications

OpenMP

Components

Application

Threads in Operating System

Runtime Libraries

User

Environment VariablesCompiler Directives

Page 97: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Intro

Definition

Pros & Cons

Architecture

Components

Model

Communications

OpenMP

Components

Compiler Directives.

Run-time libraries.

Environment variables.

Page 98: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Intro

Definition

Pros & Cons

Architecture

Components

Model

Communications

OpenMP

Execution Model: The fork/join model

Parallel regions are the building “blocks” within the code.

A master thread is started at runtime and persiststhroughout execution.

The master thread starts the team of threads at parallelregions.

joinMas

ter

thre

ad fork

Page 99: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Intro

Definition

Pros & Cons

Architecture

Components

Model

Communications

OpenMP

Thread Communication

Every thread has access to global memory - shared memory

Every thread has access to stack memory - private memory.

Code should use shared memory to communicate betweenthreads.

Simultaneous updates to shared memory can create a racecondition. Results change with different thread scheduling.

Use mutual exclusion to avoid data sharing but don’t usetoo much because it will serialize performance.

Page 100: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Intro

OpenMP

Syntax andSentinels

Parallel Control

CombinedDirectives

DataEnvironment

DirectiveClauses

OpenMP

Page 101: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Intro

OpenMP

Syntax andSentinels

Parallel Control

CombinedDirectives

DataEnvironment

DirectiveClauses

Syntax and Sentinels

Page 102: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Intro

OpenMP

Syntax andSentinels

Parallel Control

CombinedDirectives

DataEnvironment

DirectiveClauses

Parallel Control Directives

OpenMP provides two kinds of directives to control parallelism:

To create multiple threads.

To divide work among an existing set of parallel threads.

Page 103: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Intro

OpenMP

Syntax andSentinels

Parallel Control

CombinedDirectives

DataEnvironment

DirectiveClauses

Combined Directives

Some directives can however be combined together:

Page 104: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Intro

OpenMP

Syntax andSentinels

Parallel Control

CombinedDirectives

DataEnvironment

DirectiveClauses

Data Environment

Only the master thread has a data address space that lastsfor the entire duration of the run.

During the fork operation OpenMP chooses to either sharea single copy between all the threads or provide eachthread with its own private copy for the duration of theparallel construct.

There are defaults for choosing if a variable is shared orprivate.

User is advised to always specify the character of thevariable.

Page 105: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Intro

OpenMP

Syntax andSentinels

Parallel Control

CombinedDirectives

DataEnvironment

DirectiveClauses

Data Environment

Default behavior can be overridden with data scopingclauses:

FORTRAN

!$OMP PARALLEL DEFAULT(NONE) &

!$OMP SHARED(...) PRIVATE(...) REDUCTION(...)

C/C++

#pragma omp parallel default(none) \

shared(...) private(...) reduction(...)

Initialization of private variables can also be controlled withfirstprivate.

Return value of private variables can also be controlledwith lastprivate.

Page 106: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Intro

OpenMP

Syntax andSentinels

Parallel Control

CombinedDirectives

DataEnvironment

DirectiveClauses

Data Environment: Reductions

An “operation” that combines multiple elements to form asingle result is called a reduction operation.

The variable that accumulates the result is called thereduction variable.

Reduction operators and variables must be declared.

sum=0

!$omp parallel do reduction(+ : sum)

do i = 1, n

sum = sum + a(i)

enddo

Page 107: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Intro

OpenMP

Syntax andSentinels

Parallel Control

CombinedDirectives

DataEnvironment

DirectiveClauses

Directive Clauses

Clauses control the behavior of OpenMP directives.

Page 108: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Synchronisation

Runtimelibraries

Environmentvariables

Compilingand Runningan OpenMPprogramme

Part VII

More OpenMP

Page 109: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Synchronisation

Runtimelibraries

Environmentvariables

Compilingand Runningan OpenMPprogramme

Synchronisation

Page 110: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Synchronisation

Runtimelibraries

Environmentvariables

Compilingand Runningan OpenMPprogramme

Synchronisation

The term synchronization refers to the mechanism by which aparallel program can coordinate the execution of multiplethreads. The two most common forms of synchronization:

Mutual exclusion: way to control access to a sharedvariable - critical/atomic directive.

Event synchronization: way to control that threads reach aparticular point in execution simultaneously - barrierdirective.

Page 111: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Synchronisation

Runtimelibraries

Environmentvariables

Compilingand Runningan OpenMPprogramme

Synchronisation

Page 112: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Synchronisation

Runtimelibraries

Environmentvariables

Compilingand Runningan OpenMPprogramme

Runtime libraries

Page 113: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Synchronisation

Runtimelibraries

Environmentvariables

Compilingand Runningan OpenMPprogramme

Runtime libraries

Page 114: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Synchronisation

Runtimelibraries

Environmentvariables

Compilingand Runningan OpenMPprogramme

Environment variables

Page 115: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Synchronisation

Runtimelibraries

Environmentvariables

Compilingand Runningan OpenMPprogramme

Environment variables

Page 116: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Synchronisation

Runtimelibraries

Environmentvariables

Compilingand Runningan OpenMPprogramme

Compiling

Running

Compiling and Running an OpenMPprogramme

Page 117: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Synchronisation

Runtimelibraries

Environmentvariables

Compilingand Runningan OpenMPprogramme

Compiling

Running

Compiling an OpenMP program

Compiling an OpenMP program depends on the compiler:

Intel Fortran: ifort -openmp input file

Pathscale Fortran: pathf90 -mp input file

GCC: gfortran -mopenmp input file (version > 4.2)

Page 118: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Synchronisation

Runtimelibraries

Environmentvariables

Compilingand Runningan OpenMPprogramme

Compiling

Running

Running an OpenMP program

_maolive@phorcys:~> pathf90 -mp hello3.f90 -o hello3

_maolive@phorcys:~> export OMP_NUM_THREADS=5

_maolive@phorcys:~> ./hello3

Hello world from thread 1 !

Hello world from thread 0 !

Hello world from thread 4 !

Hello world from thread 3 !

Hello world from thread 2 !

Page 119: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Part VIII

Introduction to MPI

Page 120: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

MPI

Page 121: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Definition

Page 122: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Definition

MPI is a parallel programming model for message passingmultiprocessing.

It consists of a set of library calls to enable the multipleprocesses to communicate.

In MPI all instances of your parallel application are startedfrom launch and have private address spaces.

MPI is not a computer language. It can be used onFORTRAN and C/C++.

Page 123: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Advantages and Disadvantages

Page 124: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Advantages and Disadvantages

Advantages

ScalabilityCompatible even with shared memory systemsWidely availablePortable

Disadvantages

Harder to learnNon incremental parallelization

Page 125: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Fundamentals

Page 126: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Fundamentals

MPI is based on the message passing paradigm with aminimal interface as:

send(address,count,datatype,destination,tag,comm)receive(address,maxcount,datatype,source,tag,comm)

Messages can be either point-to-point or collective and canbe blocking or non-blocking.

Processes belong to a group. In each group a process isidentified by a its rank.

Each group of processes is characterized by an objectcalled a communicator.

Page 127: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Basic MPI

Page 128: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Basic MPI Tasks

Initialization and Termination

Setting up Communicators

Point to Point Communication

Collective Communication

Page 129: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Basic Routines

MPI Init Initialize MPI

MPI Comm size Find out how many processes there are

MPI Comm rank Find out which process I am

MPI Send Send a message

MPI Recv Receive a message

MPI Finalize Terminate MPI

MPI Bcast Broadcast a message

MPI Reduce Reduce a message

MPI Barrier Blocks until all processes reach the barrier

MPI Abort Terminates MPI with an error code

Page 130: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Beyond Basic?

All other functions just add:

flexibility (datatypes),robustness (nonblocking send/receive),efficiency,modularity (groups, communicators),convenience (collective operations, topologies).

Page 131: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Example

Page 132: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Initialization and Termination

program param

include mpif.h

call mpi_init(ierr)

call mpi_comm_size(MPI_COMM_WORLD,np,ierr)

call mpi_comm_rank(MPI_COMM_WORLD,mype,ierr)

...

do_something_useful

...

call mpi_finalize(ierr)

end program

Page 133: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Compiling

Running

Compiling and Running an MPI program

Page 134: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Compiling

Running

Compiling

Generally use a special compiler or compiler wrapper script:

not defined by the standard

consult your implementation

handles correct include path, library path, and libraries

MPICH-style (the most common)C

mpicc –o foo foo.c

Fortran

mpif77 –o foo foo.f

Page 135: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

MPI

Definition

Pros & Cons

Fundamentals

Basic MPI

Example

Compiling

Compiling

Running

Running

MPI programs require some help to get started

what computers should I run on?

how do I access them?

MPICH-style

mpirun –np 10 –machinefile mach ./foo

When batch schedulers are involved, all bets are off

Page 136: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Part IX

MPI

Page 137: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Collective Communication

Page 138: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Collective Communication

Collective Communications

Involves all processes within a communicator

There are three basic types:

SynchronizationData movementData movement with computation

MPI Collectives are blocking

MPI Collectives do not use message tags

Page 139: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Synchronization: Barrier

A barrier is a synchronization primitive.A node calling it will block until all the nodes within the grouphave called it.

MPI BARRIER(COMM, IERR)

Page 140: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Data Movement: Broadcast

MPI provides the broadcast primitive to send data to allcommunicator members:MPI BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM, IERR)

It must be called by each node in the group with the sameCOMM and ROOT.

Page 141: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Data Movement: Scatter, Gather

Gather and scatter are inverse operations.Gather collects data from every member of the group onthe root node in linear order by the rank of the node.Scatter parcels out data from the root to every member ofthe group in linear order by node.

MPI GATHER(SNDBUF,SCOUNT,DATATYPE,RECVBUF,RCOUNT,RDATATYPE,ROOT,COMM,IERR)

MPI SCATTER(SNDBUF,SCOUNT,DATATYPE,RECVBUF,RCOUNT,RDATATYPE,ROOT,COMM,IERR)

Page 142: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Data Movement: All Gather

Provides a more efficient way to do a gather followed by abroadcast: all members of the group receive the collected data.

MPI ALLGATHER(SNDBUF, SCOUNT, SDATATYPE, RECVBUF, RCOUNT, RDATATYPE, COMM, IERR)

Page 143: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Data Movement: All to All

The jth block sent from node i is received by node j and isplaced in the ith block.This is typically useful for implementing Fast Fourier Transform.

MPI ALLTOALL(SNDBUF, SCOUNT, SDATATYPE, RECVBUF, RCOUNT, RDATATYPE, COMM, IERR)

Page 144: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Advanced Data Movement Primitives

MPI {Scatter,Gather,Allgather,Alltoall}vv stands varying size, relative location of messages

Advantages:

flexibilityless need to copy data into temporary buffersmore compact

Disadvantages

Harder to program - more bookkeeping

Page 145: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Generalized Data Movement Primitives

Scatter vs Scatterv

Page 146: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Data movement with computation: Reduce

A global reduction combines partial results from each node inthe group using some basic function and distributes the answerto the root node:

MPI REDUCE(SNDBUF, RECVBUF, COUNT, DATATYPE, OPERATOR, ROOT, COMM, IERR)

Page 147: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Data movement with computation: AllReduce

Variants of the reduce operation where the result is returned toall processes in the group.

MPI ALLREDUCE(SNDBUF,RECVBUF,COUNT,DATATYPE,OPERATOR,COMM,IERR)

Page 148: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Data movement with computation: Scan

A scan performs partial reductions on distributed data.Specifically, the scan operation returns the reduction of thevalues in the send buffers of processes ranked 0, 1, ..., n intothe receive buffer of the node ranked n.MPI SCAN(SNDBUF, RECVBUF, COUNT, DATATYPE, OPERATOR, COMM, IERR)

Page 149: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

Synchronization

Data Mov

Adv Data Mov

D Mov & Comp

P2P

Data movement with computation: The operations

MPI SUM sumMPI PROD productMPI MAX maximum valueMPI MIN minimum valueMPI MAXLOC max. value locationMPI MINLOC min. value locationMPI LAND Logical andMPI LOR Logical orMPI LXOR Logical xorMPI BAND Bitwise andMPI BOR Bitwise orMPI BXOR Bitwise xor

Table: Common predefined operations for MPI collectives

Page 150: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock Point to Point Communication

Page 151: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Deadlock

Deadlock is a phenomenon most common with blockingcommunication. It occurs when all tasks are waiting for eventsthat haven’t been initiated yet.

Page 152: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Communication Behavior

Point to point communication functions have two distinctbehaviors: blocking and non-blocking.

Blocking

Execution is suspended until the message buffer is safe toreuse.

Non-blocking

Execution is not suspended. User is responsible for testingor waiting until message buffer is safe to reuse.

Page 153: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Communication Modes

For each communication behavior there are four modes:

Standard

Implementation dependent. Should be best compromise forreliability and performance.

Synchronous

The sending task sends the receiver a “ready to send”message. When receiver replies with “ready to receive”data is transferred.

Ready

When called a “ready to receive” message has to havearrived otherwise an error is generated.

Buffered

Copies the message to an user supplied buffer and returns.

Page 154: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Unidirectional Point to point Communication

Mode Blocking Non-blocking

Standard MPI Send MPI Isend

Synchronous MPI Ssend MPI ISsend

Ready MPI Rsend MPI IRsend

Buffered MPI Bsend MPI IBsend

MPI Recv MPI IRecv

Page 155: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Blocking Standard Mode

Page 156: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Blocking Synchronous Mode

Page 157: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Blocking Ready Mode

Page 158: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Blocking Buffered Mode

Page 159: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Non-Blocking Standard Mode

Page 160: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Point to point Communications

Ready Mode has least total overhead. However the assumption is thatreceive is already posted.

Synchronous mode is portable and “safe”. It does not depend on order(ready) or buffer space (buffered). However it incurs substantial overhead.

Buffered mode decouples sender from receiver. No sync. overhead onsending task and order of execution does not matter (ready). User cancontrol size of message buffers and total amount of space. Howeveradditional overhead may be incurred by copy to buffer and buffer space canrun out.

Standard Mode is implementation dependent. Small messages are generallybuffered (avoiding sync. overhead) and large messages are usually sentsynchronously (avoiding the required buffer space).

Page 161: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

To block or not to block?!

This code worked on one machine but does not work in general.Why?

! SEND DATA

LM=6*NES+2

DO I=1,NUMPRC

NT=I-1

IF (NT.NE.MYPRC) THEN

print *,myprc,'send',msgtag,'to',ntCALL MPI_SEND(NWS,LM,MPI_INTEGER,NT,MSGTAG, MPI_COMM_WORLD,IERR)

ENDIF

END DO

! RECEIVE DATA

LM=6*100+2

DO I=2,NUMPRC

CALL MPI_RECV(NWS,LM,MPI_INTEGER, & MPI_ANY_SOURCE,MSGTAG,MPI_COMM_WORLD,IERR)

! do something with data

END DO

Page 162: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Bidirectional point to point communications

The MPI SENDRECV function provides an efficient way tohandle a common situation where a processor needs to senddata to another processor and receive data from the sameprocessor, or a different one. Calling MPI SENDRECV is similarto calling MPI SEND followed by a call to MPI RECV.MPI SENDRECV(sendbuf, sendcount, sendtype, dest, sendtag, recvbuf, recvcount, recvtype, source, recvtag,comm, status)

MPI SENDRECV REPLACE is a version of the previousprimitive that allows the send buffer and receive buffer to bethe same, which in effect replaces the contents of the sendbuffer by that of the received buffer.MPI SENDRECV REPLACE(buf,count,datatype,dest,sendtag, source,recvtag,comm,status,ierr)

Page 163: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Collectives

P2P

Deadlock

UP2P

BP2P

Avoid Deadlock

Avoiding Deadlock

different ordering of calls

non-blocking calls

bidirectional calls

buffered mode

Page 164: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Part X

More MPI

Page 165: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Datatypes

Basic Types

D Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Derived Datatypes

Page 166: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Datatypes

Basic Types

D Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Basic Types

MPI Datatype Fortran Datatype

MPI BYTE

MPI CHARACTER CHARACTER

MPI COMPLEX COMPLEX

MPI DOUBLE PRECISION DOUBLE PRECISION

MPI INTEGER INTEGER

MPI LOGICAL LOGICAL

MPI PACKED

MPI REAL REAL

Table: Basic predefined datatypes for Fortran

Page 167: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Datatypes

Basic Types

D Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Derived Datatypes

MPI basic data-types are predefined for contiguous data ofsingle type

What if application has data of mixed types, ornon-contiguous data?

existing solutions of multiple calls or copying into bufferand packing etc. are slow, clumsy and waste memorybetter solution is creating/deriving datatypes for thesespecial needs from existing datatypes

Derived Data-types can be created recursively and atrun-time

Automatic packing and unpacking

Page 168: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Datatypes

Basic Types

D Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Derived Datatypes

Elementary: Language-defined types

Contiguous: Vector with stride of one

Vector: Separated by constant “stride”

Hvector: Vector, with stride in bytes

Indexed: Array of indices (for scatter/gather)

Hindexed: Indexed, with indices in bytes

Struct: General mixed types (for C structs etc.)

Page 169: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Datatypes

Basic Types

D Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Derived Datatypes

call MPI_Type_contiguous(count, datatype, newdatatype, ierr)

call MPI_Type_commit(newdatatype, ierr)

...

call MPI_Type_free(newdatatype,ierr)

call MPI_Type_vector(ncols, 1, nrows, MPI_DOUBLE_PRECISION, vtype, ierr)

call MPI_Type_commit(vtype, ierr)

...

call MPI_Send( A(nrows,1), 1, vtype ...)

...

call MPI_Type_free(vtype,ierr)

Page 170: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Communicators and Groups

Page 171: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Communicators and Groups

All MPI communication is relative to a communicator whichcontains a context and a group.The group is just a set of processes.

10

MPI_COMM_WORLD

COMM_1

COMM_2

0 21

3 4

432

Page 172: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Communicators and Groups

To handle communicators and groups we can usually deploy twostrategies:

Group manipulation - More general.

Direct communicator manipulation - Usually more compactand suitable for regular decompositions.

Page 173: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Communicators and Groups

First strategy

MPI_Init(&argc,&argv);

world = MPI_COMM_WORLD;

MPI_Comm_size(world,&numprocs);

MPI_Comm_rank(world,&myid);

server=numprocs-1;

MPI_Comm_group(world,&world_group);

ranks[0] = server;

MPI_Group_excl(world_group,1,ranks,&worker_group);

MPI_Comm_create(world,worker_group,&workers);

MPI_Group_free(&worker_group);

MPI_Group_free(&world_group);

...

MPI_Comm_free(&workers);

Page 174: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Communicators and Groups

Second strategy

color = ( myid == server);

MPI_Comm_split(world, color, 0, &workers);

Page 175: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Topologies

Page 176: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Topologies

Another way to use communicators is to orgnize tasks on agiven topology.

Cartesian topologies are predifined and allow to lay outMPI tasks on a cartesian coordinate grid.

(3,1)

(3,0)

(3,2)(2,2)(1,2)(0,2)

(0,1)

(0,0) (1,0) (2,0)

(1,1) (2,1)

Page 177: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Wildcards

Page 178: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Wildcards

Enables programmer to avoid having to specify a tagand/or a source:

MPI ANY TAGMPI ANY SOURCE

Enables programmer to keep algorithms general:MPI PROC NULL

can be used in send and/or receiveoperation complets immediatelyno communication involved

Page 179: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Timing

Page 180: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Timing

...

call MPI_Barrier(MPI_COMM_WORLD,ierr)

t=MPI_Wtime()

...

call MPI_Barrier(MPI_COMM_WORLD,ierr)

t=MPI_Wtime()-t

Page 181: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Introduction

I/O types inMPI programs

Parallel MPII/O to a singlefile

MPI I/O

Page 182: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Introduction

I/O types inMPI programs

Parallel MPII/O to a singlefile

Introduction

Programming Languages have predefined functions tohandle files.

Typical operations are open, close, read and write.

MPI supports counterparts.

These MPI routines can conveniently express parallelism.

Page 183: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Introduction

I/O types inMPI programs

Parallel MPII/O to a singlefile

I/O types in MPI programs

Non-parallel I/O

I/O to separate files

Parallel I/O

Page 184: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Datatypes

Comm &Groups

Topologies

Wildcards

Timing

MPI I/O

Introduction

I/O types inMPI programs

Parallel MPII/O to a singlefile

Parallel MPI I/O to a single file

#include "mpi.h"

#include <stdio.h>

#define BUFSIZE 100

int main(int argc, char *argv[])

{

int i, myrank, buf[BUFSIZE];

MPI_File thefile;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

for (i=0; i<BUFSIZE; i++)

buf[i] = myrank * BUFSIZE + i;

MPI_File_open(MPI_COMM_WORLD, "testfile",

MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &thefile);

MPI_File_set_view(thefile, myrank * BUFSIZE * sizeof(int),

MPI_INT, MPI_INT, "native", MPI_INFO_NULL);

MPI_File_write(thefile, buf, BUFSIZE, MPI_INT, MPI_STATUS_IGNORE);

MPI_File_close(&thefile);

MPI_Finalize();

return 0;

}

Page 185: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm

Part XI

Hybrid Programming

Page 186: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm

Hybrid Programming

Page 187: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm

Distributed and Shared Memory Systems

Page 188: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm

Distributed and Shared Memory Systems

Combines distributed memory parallelization with on-nodeshared memory parallelization.

Largest systems now employ both architectures.

Page 189: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm

Why Hybrid Computing

Page 190: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm

Why Hybrid Computing

Eliminates domain decomposition at node

Automatic coherency at node

Lower memory latency and data movement within node

Page 191: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

Modes of Hybrid Computing

Page 192: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

Modes of Hybrid Computing - Tasks and Threads

Fixing the execution to a particular processing unit and memory bank canspeed up execution. Consider using “numactl”...

Running hybrid codes on a queuing system requires special configuration orskillful scripting.

Page 193: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

Hybrid Coding

Page 194: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

Modes of Hybrid Computing - Types

Support Level Description

MPI THREAD SINGLE Only one thread will execute.

MPI THREAD FUNNELED Process may be multithreadedbut only main thread can makeMPI calls. Default mode.

MPI THREAD SERIALIZE Process my be multithreaded,any thread can make MPIcalls but threads cannot exe-cute MPI calls concurrently.

MPI THREAD MULTIPLE Multiple threads may call MPI.No restrictions.

Page 195: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

MPI Initialization

Fortran: call MPI Init thread( required, provided, ierr)

C: int MPI Init thread(int *argc, char ***argv, intrequired, int *provided)

Page 196: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

Funneled Mode

MPI THREAD FUNNELED

Use OMP BARRIER since there is no implicit barrier inmaster work-share construct (OMP MASTER).

All other threads will be sleeping.

Page 197: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

Funneled Mode

Page 198: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

Serialized Mode

MPI THREAD SERIALIZED

Only OMP BARRIER at beginning, since there is animplicit barrier in SINGLE work-share construct(OMP SINGLE).

All other threads will be sleeping.

Page 199: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

Serialized Mode

Page 200: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

Multi-Threaded Mode

MPI THREAD MULTIPLE

No restrictions.

Page 201: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Task & Threads

Code

Types

Initiialization

Funneled

Serialized

Multi-Threaded

Overlap

Comm

Multi-threaded Mode

Page 202: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm

Overlapping Communication and Work

Page 203: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm

Overlapping Communication and Work

One core can saturate the PCI to network bus. Why use allto communicate?

Communicate with one or several cores.

Work with others during communication.

Need at least MPI THREAD FUNNELED support.

Can be difficult to manage and load balance!

Page 204: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm

Overlapping Communication and Work

Page 205: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm

Thread-rank Communication

Page 206: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm

Thread-rank Communication

Can use thread and rank id in communication.

Usual technique in multi-thread is to use tags todistinguish threads.

Page 207: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HybridProgramming

DSM

Why

Modes

Overlap

Comm

Thread-rank Communication

Page 208: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Development

Part XII

Introduction to Software Development Tools

Page 209: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Development

Introduction to Software DevelopmentTools

Page 210: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Development

Introduction to Software Development Tools

Software Development Tools

Timers

Compiler Report & Listings

Debugger

Libraries

Compiler Optimizations

Profilers

Am I being naive?

Hardware Performance Counters

Does my code run?

What is compiler doing?

Time it!!!

Slow?

Is it optimal?

Is it really optimal?

Sucess!!!

Code Development

Page 211: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Libraries

Part XIII

Libraries

Page 212: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Libraries

Why ParallelLibraries?

PerformanceLibraries

OptimizedLibraries

Common HPCLibraries Libraries

Page 213: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Libraries

Why ParallelLibraries?

PerformanceLibraries

OptimizedLibraries

Common HPCLibraries

Why Parallel Libraries?

Like most programming tasks, very little “real” software iscreated by starting from a blank slate and coding every lineof every algorithm.

Large scale parallel software construction involvessignificant code reuse, making use of libraries thatencapsulate much of what we learned.

Page 214: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Libraries

Why ParallelLibraries?

PerformanceLibraries

OptimizedLibraries

Common HPCLibraries

Performance Libraries

Optimized for specific architectures.

Use library routines instead of hand-coding your own.

Offered by different vendors:

ESSL/PESSL on IBM systems.Intel MKL for IA32, EM64T and IA64.Cray libsci for Cray systems.SCSL for SGI.ACML for AMD.

Page 215: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Libraries

Why ParallelLibraries?

PerformanceLibraries

OptimizedLibraries

Common HPCLibraries

Optimized Libraries

Use optimized libraries:

In “hot spots”, never write library functions by hand.Numerical Recipes books DO NOT provide optimized code.(Libraries can be 100x faster).

Page 216: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Libraries

Why ParallelLibraries?

PerformanceLibraries

OptimizedLibraries

Common HPCLibraries

Common HPC Libraries

SPRNG - Parallel Random Numbers

FFTW - Parallel FFT (MPI, OpenMP)

ScaLAPack - Parallel Linear Algebra (MPI)

GOTOATLASIntel Math Kernel Libraries (MKL)

Parallel Linear Algebra+ (OpenMP)

PETSc - Parallel PDEs and related problems (MPI)

Page 217: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

CompilerOptimization

Part XIV

Compiler Optimization

Page 218: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

CompilerOptimization

Introduction

OptimizationLevels

Intel CompilerOptions

PGI CompilerOptions

Conclusions

Compiler Optimization

Page 219: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

CompilerOptimization

Introduction

OptimizationLevels

Intel CompilerOptions

PGI CompilerOptions

Conclusions

Compiler Optimizations

The compiler now does a very good job of optimizing codeso you don’t have to.

But, program developers should ensure that their codes areadaptable to hardware evolution and are scalable.

Page 220: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

CompilerOptimization

Introduction

OptimizationLevels

Intel CompilerOptions

PGI CompilerOptions

Conclusions

Optimization Level: -On

-O0 no optimization: Fast compilation, disablesoptimization.

-O1 optimization for speed, keeps code size small.

-O2 low to moderate optimization: partial debuggingsupport, disables inlining.

-O3 aggressive optimization: compile time/ space intensiveand/or marginal effectiveness; may change code semanticsand results (sometimes even breaks codes!)

Page 221: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

CompilerOptimization

Introduction

OptimizationLevels

Intel CompilerOptions

PGI CompilerOptions

Conclusions

Optimization Levels

Operations performed at moderate optimization levels

instruction reschedulingcopy propagationsoftware pipeliningcommon subexpression eliminationprefetching, loop transformations

Operations performed at aggressive optimization levels

enables -O3more aggressive prefetching, loop transformations

Page 222: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

CompilerOptimization

Introduction

OptimizationLevels

Intel CompilerOptions

PGI CompilerOptions

Conclusions

Intel Compiler Options

Processor-specific optimization options

-xT generates specialized code for EM64T, includes SSE4

Other optimization options:

-mp maintain floating point precision (disables someoptimizations).-mp1 improve floating-point precision (speed impact is lessthan -mp).-ip enable single-file interprocedural (IP) optimizations(within files).-ipo enable multi-file IP optimizations (between files)

Page 223: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

CompilerOptimization

Introduction

OptimizationLevels

Intel CompilerOptions

PGI CompilerOptions

Conclusions

Intel Compiler Options

Other options:

-g debugging information, generates symbol table.

-strict ansi strict ANSI compliance.

-C enable extensive runtime error checking.

-convert kwd specify file format keyword: big endian, cray,ibm, little endian, native, vaxd

-openmp enable the parallelizer to generate multi-threadedcode based on the OpenMP directives.

-static create a static executable for serial applications.

Page 224: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

CompilerOptimization

Introduction

OptimizationLevels

Intel CompilerOptions

PGI CompilerOptions

Conclusions

Intel Compiler - Best Practice

Normal compiling ifort –O3 –xT test.c

Try compiling at -O3 -xT.

If code breaks or gives wrong answers with -O3 xT, firsttry -mp (maintain precision).

-O2 is default opt, compile with -O0 if this breaks (veryrare)

-xT can include optimizations and may break some codes

Don’t include debug options for a production compile!

ifort -O2 -g test.c

Page 225: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

CompilerOptimization

Introduction

OptimizationLevels

Intel CompilerOptions

PGI CompilerOptions

Conclusions

PGI Compilers

-03 performs some compile time and memory intensiveoptimizations in addition to those executed with -O2, butmay not improve performance for all programs.

Mipa=fast Interprocedural optimizations There is a loaderproblem with this option.

-tp barcelona-64 includes specialized code for the barcelonachip.

-fast

-O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline-Mvect=sse -Mscalarsse -Mcache align -Mflushz

-mp enable the parallelizer to generate multi-threaded codebased on the OpenMP directives.

-Minfo=mp,ipa Information about OpenMP,interprocedural optimization-helplists options.

Page 226: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

CompilerOptimization

Introduction

OptimizationLevels

Intel CompilerOptions

PGI CompilerOptions

Conclusions

Tuning Parameters

Different for each compiler.

Differences even between compiler versions.

Make sure you test and use them. You may missing out onsome well deserved performance boost.But, proceed with caution. ,

Page 227: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Debuggingand Profiling

AnalysesTools

Part XV

Debugging and Profiling

Page 228: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Debuggingand Profiling

StandardDebuggers

DebuggingBasics

CommercialDebuggers

AnalysesTools

Debugging and Profiling

Page 229: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Debuggingand Profiling

StandardDebuggers

DebuggingBasics

CommercialDebuggers

AnalysesTools

Standard Debuggers

The standard command line debugging tool is gdb inLinux. You can use this debugger for programs written inC, C++ and Fortran.

There are graphical frontends for gdb.

Page 230: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Debuggingand Profiling

StandardDebuggers

DebuggingBasics

CommercialDebuggers

AnalysesTools

Debugging Basics

For effective debugging a couple of commands need to bemastered:

set breakpoints: regular and conditionaldisplay the value of variablesset new valuesstep through a program

Less used commands can be learned as they becomenecessary.

Page 231: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Debuggingand Profiling

StandardDebuggers

DebuggingBasics

CommercialDebuggers

AnalysesTools

gdb Basics

Common commands for gdb:

run - starts the program; if you do not set up anybreakpoints the program will run until it terminates or coredumps.

print - prints a variable located in the current scope.

next - executes the current command, and moves to thenext command in the program.

step - steps through the next command.

break - sets a break point.

continue - used to continue till next breakpoint ortermination

Note: if you are at a function call, and you issue next, then thefunction will execute and return. However, if you issue step,then you will go to the first line of that function

Page 232: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Debuggingand Profiling

StandardDebuggers

DebuggingBasics

CommercialDebuggers

AnalysesTools

gdb Basics

More commands for gdb:

list - show code listing near the current execution location

delete - delete a breakpoint

condition - make a breakpoint conditional

display - continuously display value

undisplay - remove displayed value

where - show current function stack trace

help - display help text

quit - exit gdb

Page 233: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Debuggingand Profiling

StandardDebuggers

DebuggingBasics

CommercialDebuggers

AnalysesTools

Commercial Debuggers: DDT & Totalview

Interactive, parallel, symbolic debuggers with GUI interface.

Works with C, C++ and Fortran Compilers

Available for many different platforms.

Supports OpenMP & MPI (and hybrid paradigm)

Support 32- and 64-bit architectures

Simple to use (intuitive)

Page 234: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Debuggingand Profiling

AnalysesTools

Goals

CompilerReports &Listings

Timers

Profilers

HardwarePerformanceCounters

Analyses Tools

Page 235: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Debuggingand Profiling

AnalysesTools

Goals

CompilerReports &Listings

Timers

Profilers

HardwarePerformanceCounters

Performance Analysis Goals

Identify hotspot candidates for further study andoptimization potential

Test optimization changes to verify usefulness

Floating-point improvements aimed at reducing overallwall-clock run time (but may potentially reduce scalability)

MPI improvements aimed at reducing MPI Idle time andimprove scalability

Page 236: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Debuggingand Profiling

AnalysesTools

Goals

CompilerReports &Listings

Timers

Profilers

HardwarePerformanceCounters

Compiler Reports & Listings

Compilers will optionally generate optimization reports &listing files.

Use the Loader Map to determine what libraries you haveloaded.

To activate them:

-Minfo=time,loop,inline,sym... (pgi)-opt-report (optimization,Intel)-S (listing,Intel)

Page 237: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Debuggingand Profiling

AnalysesTools

Goals

CompilerReports &Listings

Timers

Profilers

HardwarePerformanceCounters

Timers

Page 238: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Debuggingand Profiling

AnalysesTools

Goals

CompilerReports &Listings

Timers

Profilers

HardwarePerformanceCounters

Profilers

Profilers determine:

Call Tree

Wall clock/CPU time spent on each function

HW counters (e.g., cache misses, FLOPs)

A common Unix profiling tool is gprof.

Page 239: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Debuggingand Profiling

AnalysesTools

Goals

CompilerReports &Listings

Timers

Profilers

HardwarePerformanceCounters

Profilers

Page 240: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

Debuggingand Profiling

AnalysesTools

Goals

CompilerReports &Listings

Timers

Profilers

HardwarePerformanceCounters

Hardware Performance Counters

Information obtained with profilers is usual extremelyhelpful but may not be enough to completely optimize acode.

Hardware performance counters usually requireinstrumenting the code but give a much more detailedview.

There are several solutions:

PAPITAUPDT

Page 241: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HPCAvailableResources

Part XVI

HPC Available Resources

Page 242: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HPCAvailableResources

HPC AvailableResources:Portugal

HPC AvailableResources:Europe HPC Available Resources

Page 243: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HPCAvailableResources

HPC AvailableResources:Portugal

HPC AvailableResources:Europe

HPC Available Resources: Portugal

MILIPEIA - http://www.lca.uc.pt

INGRID - http://www.gridcomputing.pt

RNCA - http://www.rnca.org.pt

Page 244: Advanced Computing M.A. Oliveira Topics in Advanced …hpc/Presentations/Presentation_MAO.pdfAdvanced Computing M.A. Oliveira Outline Topics in Advanced Computing Miguel Afonso Oliveira

AdvancedComputing

M.A. Oliveira

HPCAvailableResources

HPC AvailableResources:Portugal

HPC AvailableResources:Europe

HPC Available Resources: Europe

HPCEuropa - http://www.hpc-europa.org

PRACE - http://www.prace-project.eu