An Introduction to Parallel Programming programming mode… · An Introduction to Parallel...

71

Transcript of An Introduction to Parallel Programming programming mode… · An Introduction to Parallel...

Page 1: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

An Introduction to Parallel Programming

Marco Ferretti, Mirto Musci

Dipartimento di Informatica e Sistemistica

University of Pavia

Processor Architectures, Fall 2011

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 2: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

MotivationTaxonomy

De�nition

What is parallel programming?

Parallel computing is the simultaneous use of multiple computeresources to solve a computational problem

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 3: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

MotivationTaxonomy

Outline - Serial Computing

Traditionally, software has been written for serial computation:

To be run on a single CPU;

A problem is broken into a discrete series of instructions.

Instructions are executed one after another.

Only one instruction may execute at any moment

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 4: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

MotivationTaxonomy

Outline - Parallel Computing

Simultaneous use of multiple compute resources to solve acomputational problem:

To be run using multiple CPUsA problem is broken into discrete parts, solved concurrentlyEach part is broken down to a series of instructionsInstructions from each part execute simultaneously on di�erentCPUs

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 5: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

MotivationTaxonomy

The Real World is Massively Parallel

Parallel computing attempts to emulate what have always been thestate of a�airs in the natural world: many complex, interrelatedevents happening at the same time, yet within a sequence

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 6: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

MotivationTaxonomy

Why Parallel Computing?

Save time and/or money

Solve larger problems

Provide concurrency (e.g. the Access Grid)

Use of non-local resources (e.g. SETI@home)

Limits to serial computing, physycal and practical

Transmission speedsMiniaturizationEconomic limitations

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 7: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

MotivationTaxonomy

Application I

Traditional Applications

High performance computing:

Atmosphere, Earth, EnvironmentPhysics - applied, nuclear, particle, condensed matter, highpressure, fusion, photonicsBioscience, Biotechnology, GeneticsChemistry, Molecular SciencesGeology, SeismologyMechanical Engineering modelingCircuit Design, Computer Science, Mathematics

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 8: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

MotivationTaxonomy

Application II

Emergent Applications

Nowadays, industry is THE driving force:

Databases, data miningOil explorationWeb search engines, web based business servicesMedical imaging and diagnosisPharmaceutical designFinancial and economic modelingAdvanced graphics and virtual realityCollaborative work environments

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 9: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

MotivationTaxonomy

Flynn's Classical Taxonomy

There are di�erent ways to classify parallel computers.

Flynn's Taxonomy (1966) distinguishes multi-processorarchitecture by instruction and data:

SISD � Single Instruction, Single DataSIMD � Single Instruction, Multiple DataMISD � Multiple Instruction, Single DataMIMD � Multiple Instruction, Multiple Data

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 10: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

MotivationTaxonomy

Flynn's Classical Taxonomy - SISD

Serial (non-parallel) computer

Single Instruction: Only oneinstruction stream acted on by theCPU during a clock cycle

Single Data: Only one data stream ininput

Deterministic execution

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 11: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

MotivationTaxonomy

Flynn's Classical Taxonomy - SIMD

Single Instruction: All CPUs executethe same instruction at a given clockcycle

Multiple Data: Each CPU can operateon a di�erent data element

Problems characterized by highregularity, such as graphics processing

Synchronous (lockstep) anddeterministic execution

Vector processors and GPU Computing

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 12: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

MotivationTaxonomy

Flynn's Classical Taxonomy - MISD

Multiple Instruction: Each CPUoperates on the data independently

Single Data: A single data stream formultiple CPU

Very few practical uses for this type ofclassi�cation.

Example: Multiple cryptographyalgorithms attempting to crack asingle coded message.

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 13: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

MotivationTaxonomy

Flynn's Classical Taxonomy - MIMD

Multiple Instruction: Every processormay be executing a di�erentinstruction stream

Multiple Data: Every processor may beworking with a di�erent data stream

Synchronous or asynchronous,deterministic or non-deterministic

Examples: most supercomputers,networked clusters and "grids",multi-core PCs

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 14: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

MotivationTaxonomy

Outline

1 Memory ArchitectureShared MemoryDistributed MemoryHybrid Architecture

2 Parallel programming modelsShared Memory ModelMessage Passing ModelGPGPUHybrid Model

3 Designing Parallel ProgramsGeneral CharacteristicsPartitioningLimits and Costs

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 15: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared MemoryDistributed MemoryHybrid Architecture

Outline

1 Memory ArchitectureShared MemoryDistributed MemoryHybrid Architecture

2 Parallel programming modelsShared Memory ModelMessage Passing ModelGPGPUHybrid Model

3 Designing Parallel ProgramsGeneral CharacteristicsPartitioningLimits and Costs

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 16: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared MemoryDistributed MemoryHybrid Architecture

Shared Memory Architecture

Shared memory: all processors access all memory as a globaladdress space.

CPUs operate independently but share memory resources.

Changes in a memory location e�ected by one processor arevisible to all other processors.

Two main classes based upon memory access times: UMA andNUMA.

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 17: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared MemoryDistributed MemoryHybrid Architecture

Uni�ed Memory Access

Uniform Memory Access (UMA):

Identical processors

Equal access and access times to memory

Cache coherent: if one processor updates a location in sharedmemory, all the other processors know about the update.

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 18: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared MemoryDistributed MemoryHybrid Architecture

Non-Uniform Memory Access

Non-Uniform Memory Access (NUMA):

Often made by physically linking two or more SMPs

One SMP can directly access memory of another SMP

Not all processors have equal access time to all memories

Memory access across link is slower

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 19: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared MemoryDistributed MemoryHybrid Architecture

Pros and Cons

Advantages

Global address space is easy to program

Data sharing is fast and uniform due to the proximitymemory/CPUs

Disadvantages

Lack of scalability between memory and CPUs

Programmer responsibility for synchronization

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 20: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared MemoryDistributed MemoryHybrid Architecture

Outline

1 Memory ArchitectureShared MemoryDistributed MemoryHybrid Architecture

2 Parallel programming modelsShared Memory ModelMessage Passing ModelGPGPUHybrid Model

3 Designing Parallel ProgramsGeneral CharacteristicsPartitioningLimits and Costs

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 21: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared MemoryDistributed MemoryHybrid Architecture

General Characteristics

Communication network to connect inter-processor memory.

Processors have their own local memory.

No concept of global address space

CPUs operate independently.

Change to local memory have no e�ect on the memory ofother processors.Cache coherency does not apply.

Data communication and synchronization are programmer'sresponsibility

Connection used for data transfer varies (es. Ethernet)

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 22: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared MemoryDistributed MemoryHybrid Architecture

Simple Diagram

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 23: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared MemoryDistributed MemoryHybrid Architecture

Outline

1 Memory ArchitectureShared MemoryDistributed MemoryHybrid Architecture

2 Parallel programming modelsShared Memory ModelMessage Passing ModelGPGPUHybrid Model

3 Designing Parallel ProgramsGeneral CharacteristicsPartitioningLimits and Costs

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 24: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared MemoryDistributed MemoryHybrid Architecture

Hybrid Distributed-Shared Memory

The largest and fastest computers in the world today employhybrid architectures.

The shared memory component can be a cache coherent SMPmachine and/or graphics processing units (GPU).

The distributed memory component is the networking ofmultiple SMP/GPU machines, which know only about theirown memory.

Network communications are required to move data from oneSMP/GPU to another.

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 25: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared MemoryDistributed MemoryHybrid Architecture

Simple Diagram

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 26: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared Memory ModelMessage Passing ModelGPGPUHybrid Model

Overview

Several parallel programming models in common use:

Shared Memory / ThreadsDistributed Memory / Message PassingHybridGPGPU... and many more

Parallel programming models exist as an abstraction abovehardware and memory architectures.

No "best" model, although there are better implementationsof some models over others.

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 27: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared Memory ModelMessage Passing ModelGPGPUHybrid Model

Overview II

Models are NOT speci�c to a particular machine or architecture.

Shared model on a distributed memory machine: KSR1

Machine is physically distributed, but appear globalVirtual Shared Memory

Distributed model on a shared memory machine: Origin 2000.

CC-NUMA architectureTasks see global address space, but use MP

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 28: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared Memory ModelMessage Passing ModelGPGPUHybrid Model

Outline

1 Memory ArchitectureShared MemoryDistributed MemoryHybrid Architecture

2 Parallel programming modelsShared Memory ModelMessage Passing ModelGPGPUHybrid Model

3 Designing Parallel ProgramsGeneral CharacteristicsPartitioningLimits and Costs

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 29: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared Memory ModelMessage Passing ModelGPGPUHybrid Model

Introduction

Multiple processes or light threads with uni�ed memory vision

Need for careful synchronization

Programmer is responsible for determining all parallelism.

Application Domain

Number crunching on single, heavily parallel machines

Desktop applications

In Hpc or industrial application is seldom used by itself(clustering)

but Intel is developing Cluster OpenMP...

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 30: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared Memory ModelMessage Passing ModelGPGPUHybrid Model

Threading Model

Similar to a single program that includes a number ofsubroutines:

a.out performs some serial work, and then creates a number oftasks (threads) that can be scheduled and run concurrently.Each thread has local data, but also shares the entire resourcesof a.outA thread's work could be seen as a subroutineThreads communicate through global memory.Threads come and go, but a.out remains

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 31: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared Memory ModelMessage Passing ModelGPGPUHybrid Model

Implementations

Explicit approaches

Linux: Pthreads library

Library based; requires parallel codingC Language onlyVery explicit parallelism

Microsoft Windows Threading

Frameworks

Intel TBB

OpenMP

Distributed shared memory OS (research topic)

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 32: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared Memory ModelMessage Passing ModelGPGPUHybrid Model

OpenMP

It's not a library: series of compiler directives or pragmas

#pragma omp parallel ...requires explicit compiler supports

Features:

Automatic data layout and decompositionIncremental parallelism: one portion of program at one timeUni�ed code for both serial and parallel applicationsCoarse-grained and �ne-grained parallelism possibleScalability is limited by memory architecture

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 33: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared Memory ModelMessage Passing ModelGPGPUHybrid Model

Outline

1 Memory ArchitectureShared MemoryDistributed MemoryHybrid Architecture

2 Parallel programming modelsShared Memory ModelMessage Passing ModelGPGPUHybrid Model

3 Designing Parallel ProgramsGeneral CharacteristicsPartitioningLimits and Costs

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 34: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared Memory ModelMessage Passing ModelGPGPUHybrid Model

Distributed Memory / Message Passing Model

Multiple tasks on arbitrary number of machines, each withlocal memory.Tasks exchange data through communications by sending andreceiving messages.Data transfer usually requires cooperative operations

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 35: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared Memory ModelMessage Passing ModelGPGPUHybrid Model

MPI: introduction

Language independente API speci�cation: allows processes tocommunicate with one another by sending and receivingmessages

De facto standard for HPC on clusters and supercomputers

MPI was created in 1992; �rst standard appeared in 1994.Superseded PVM.

Various implementations

OpenMPI: open source, widespread; faculties and accademicistitution; several top500 supercomputerscommercial implementations from HP, Intel, and Microsoft(derived from the early MPICH)custom implementations, parts in assembler or even inhardware (research topic)

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 36: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared Memory ModelMessage Passing ModelGPGPUHybrid Model

MPI: features

Very portable: language and architecture independent

MPI hide communication complexities between distributedprocesses

virtual topologysynchronization

Single process maps to single processor

Need for an external agent (mpirun, mpiexec) to coordinateand manage task assignment and program termination

Original (serial) application must be rewritten

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 37: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared Memory ModelMessage Passing ModelGPGPUHybrid Model

MPI: functions

Library Functions

Point-to-point communication

synchronousasynchronousbu�eredready forms

Broadcast and multicast communication

Fast and safe combination of partial results: gather and reduce

Node synchronization: barrier

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 38: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared Memory ModelMessage Passing ModelGPGPUHybrid Model

Outline

1 Memory ArchitectureShared MemoryDistributed MemoryHybrid Architecture

2 Parallel programming modelsShared Memory ModelMessage Passing ModelGPGPUHybrid Model

3 Designing Parallel ProgramsGeneral CharacteristicsPartitioningLimits and Costs

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 39: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared Memory ModelMessage Passing ModelGPGPUHybrid Model

GPGPU: Introduction

Using a GPU to perform computation in applicationstraditionally handled by the CPU

Made possible by rendering pipeline modi�cation:

addition of programmable stageshigher precision arithmetic

Allows software developers to use stream processing onnon-graphics data.

Especially suitable for applications that exhibit:

Compute IntensityData ParallelismData Locality

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 40: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared Memory ModelMessage Passing ModelGPGPUHybrid Model

Standards

CUDA and Stream

Respectively NVDIA and AMD standards

Framework for explicit GPU programming

Language extension, library and runtime environments

Common features:

Heavily threaded (ten of thousands)Multi-level parallelism: Grid, Blocks, threads, weave...Explicit memory copy to/from CPULinear Bidimensional memory access (texture memory)

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 41: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared Memory ModelMessage Passing ModelGPGPUHybrid Model

Application domains

Where GPGPU has been used succesfully

Raytracing Global illumination (not normally supported by therendering pipeline)

Physical based simulation and physics engines

Weather forecastingClimate researchMolecular modeling, ...

Computational �nance, Medical imaging, Digital signalprocessing, Database operations, Cryptography andcryptanalysis, Fast Fourier transform, ...

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 42: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared Memory ModelMessage Passing ModelGPGPUHybrid Model

Outline

1 Memory ArchitectureShared MemoryDistributed MemoryHybrid Architecture

2 Parallel programming modelsShared Memory ModelMessage Passing ModelGPGPUHybrid Model

3 Designing Parallel ProgramsGeneral CharacteristicsPartitioningLimits and Costs

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 43: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared Memory ModelMessage Passing ModelGPGPUHybrid Model

Hybrid Model

Combines the previously described programming models

Well suited for cluster of SMP machines

Example: Using MPI with GPU programming.

GPUs perform intensive kernels using local, on-node dataMPI handles communications

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 44: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared Memory ModelMessage Passing ModelGPGPUHybrid Model

Implementations

Prevalence of ad hoc solutions

New Standards

Compiler Directives

OpenHMPPOpenMP 3.x

Programming Libraries: OpenCL

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 45: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared Memory ModelMessage Passing ModelGPGPUHybrid Model

OpenCL

Host-Device abstraction: handles any sort of parallelcomputing domain

CPU to CPUCPU to GPUCPU to DSP or any embedded device

Need for explicit hardware support

NVIDIA, ATI, AMD support OpenCLUnfortunately low embedded device support

Similar to CUDA, but more generic and somewhat lessprogrammer-friendly

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 46: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

Shared Memory ModelMessage Passing ModelGPGPUHybrid Model

Summary

Parallel techologies are pervasive

Lot of possible approaches

Continue development and innovation

Future developement?

Hybrid techologies are the key but: need for hardware supportand programmers assistance

GPU Computing will become common in desktop andembedded system

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 47: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Outline

1 Memory ArchitectureShared MemoryDistributed MemoryHybrid Architecture

2 Parallel programming modelsShared Memory ModelMessage Passing ModelGPGPUHybrid Model

3 Designing Parallel ProgramsGeneral CharacteristicsPartitioningLimits and Costs

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 48: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Automatic vs Manual Paralelization

Automatic

Compiler analyzes code and identi�es parallelismAttempt to compute if actually improves performanceLoops are the most frequent target

Manual - Understand the problem

Parallelizable Problem

Calculate the potential energy for several molecule's

conformations. Find the minimum energy conformation.

A Non-Parallelizable Problem

The Fibonacci Series

All calculations are dependent

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 49: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Keywords

Parallel Design Keywords

Synchronization

Communications

Data Dependance

Load Balancing

Granularity

Partitioning

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 50: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Synchronization

De�nition

The coordination of parallel tasks in real time

Barrier

Each task performs its work until it stops at the barrierWhen the last task arrives, all tasks are synchronizedWhat happens from here varies (serial work or task release)

Lock / semaphore

Used to protect access to global data or a section of code.Only one task at a time may �own� the lockOther tasks can attempt to acquire the lock but must waituntil release.

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 51: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Communications

De�nition

Data exchange between parallel tasks

Who Needs Communications?

Embarrassingly parallel (e.g. operation on indipendent pixels)Most problems are not that simple (e.g. 3D heat di�usion)

Factors to Consider:

Cost of communicationsLatency vs. BandwidthVisibility of communicationsSynchronous vs. asynchronous communicationsScope of communications (point-to-point, collective)

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 52: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Communications Complexity

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 53: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Data Dependencies

De�nition

A dependence exists between program statements when the orderof statement execution a�ects the results of the program

Results from multiple use of the same storage by di�erenttasks.

Primary inhibitors to parallelism.

Loop carried dependencies are particularly important

How to Handle Data Dependencies:

Distributed: communicate data at synchronization points.Shared: synchronize read/write operations between tasks.

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 54: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Loop Carried Dependence

Loop carried data dependence

for (j=1; j<N; j++) {

vect[j] = vect [j-1] * 2;

}

vect[J-1] must be computed before vect[J]: data dependency

If Task 2 has vect[J] and task 1 has vect[J-1], computing thecorrect value necessitates:

Distributed: task 2 must obtain the value of vect[J-1] fromtask 1 after it �nishes computationShared: task 2 must read vect[J-1] after task 1 updates it

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 55: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Lood Independent Dependence

Loop independent data dependence

task 1 task 2

------ ------

X = 2 X = 4

. .

. .

Y = X**2 Y = X**3

As before, parallelism is inhibited.

The value of Y is dependent on:

Distributed: if or when the value of X is communicatedbetween the tasks.Shared: which task last stores the value of X.

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 56: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Load Balancing

De�nition

Practice of distributing work among tasks so that all tasks are keptbusy all of the time: minimization of task idle time.

Important for performance reasonExample: if all tasks are subject to a barrier, the slowest willdetermine overall performance.

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 57: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Load Balancing II

How to Achieve Load Balance:

Equally partition the work each task receives

For array/matrix operations: evenly distribute the data set.For loop iterations: evenly distribute the iterations.Heterogeneous mix of machines: analysis tool to detectimbalances.

Use dynamic work assignment

Certain problems result in load imbalances anyway

Sparse arrays

Adaptive grid methods

If workload is variable or unpredictable: use a scheduler. Aseach task �nishes, it queues to get a new piece of work.Design an algorithm which detects and handles loadimbalances as they occur.

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 58: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Granularity

Granularity: The ratio of computation tocommunication

Fine: Low computation, highcommunication

High communication overheadFacilitates load balancing

Coarse: High computation, lowcommunication

Potential for performanceHarder to load balance e�ciently

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 59: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Outline

1 Memory ArchitectureShared MemoryDistributed MemoryHybrid Architecture

2 Parallel programming modelsShared Memory ModelMessage Passing ModelGPGPUHybrid Model

3 Designing Parallel ProgramsGeneral CharacteristicsPartitioningLimits and Costs

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 60: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Partitioning

One of the �rst steps in design is to break the problem intodiscrete "chunks" of work that can be distributed to multipletasks.

Two basic ways:

Domain decompositionFunctional decomposition

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 61: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Domain Decomposition

Data associated with a problem is decomposed. Each parallel taskthen works on a portion of of the data.

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 62: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Functional Decomposition

Problem is decomposed according to the computation rather thanto data. Each task then performs a portion of the overall work.

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 63: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Signal Processing Example

An audio signal data set is passed through four distinctcomputational �lters. Each �lter is a separate process. The �rstsegment of data must pass through the �rst �lter beforeprogressing to the second. When it does, the second segment ofdata passes through the �rst �lter. By the time the fourth segmentof data is in the �rst �lter, all four tasks are busy.

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 64: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Outline

1 Memory ArchitectureShared MemoryDistributed MemoryHybrid Architecture

2 Parallel programming modelsShared Memory ModelMessage Passing ModelGPGPUHybrid Model

3 Designing Parallel ProgramsGeneral CharacteristicsPartitioningLimits and Costs

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 65: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Complexity

Parallel applications could be much more complex

Multiple instruction streamsData �owing

The costs of complexity are measured in programmer time inevery aspect of the development cycle:

DesignCodingDebuggingMaintenance

Good software development practices are essential whenworking with parallel applications

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 66: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Portability

Thanks to standardization in several APIs, portability issuesare not as serious as in years past

However, all of the usual serial portability issues apply

ImplementationsOperating systemsHardware architectures

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 67: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Resource Requirements

Decrease in wall clock time, often means increase in CPU time

The amount of memory required can be greater

Data replicationOverheads

For short parallel programs, performance can decrease

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 68: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Amdahl's Law I

Potential program speedup is de�ned by the fraction of code(P) that can be parallelized:

speedup=1

1−P

If none of the code can be parallelized, P = 0 and the speedup= 1 (no speedup). If all of the code is parallelized, P = 1 andthe speedup is in�nite (in theory).

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 69: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

Amdahl's Law II

Introducing the number N of processors (S is serial part):

speedup=1

P

N+S

There are limits to the scalability of parallelism.

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 70: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

IntroductionMemory Architecture

Parallel programming modelsDesigning Parallel Programs

General CharacteristicsPartitioningLimits and Costs

The Future:

During the past 20+ years, the trends indicated that parallelism isthe future of computing.In this same period, 1000x increase in supercomputer performance.The race is already on for Exascale Computing!

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming

Page 71: An Introduction to Parallel Programming programming mode… · An Introduction to Parallel Programming Marco Ferretti, Mirto Musci Dipartimento di Informatica e Sistemistica University

Appendix For Further Reading

For Further Reading I

A. Grama, A. GuptaIntroduction to Parallel Computing

Addison-Weasley, 2003.

Blaise BarneyIntroduction to Parallel Computing, 2011https://computing.llnl.gov/tutorials/

Marco Ferretti, Mirto Musci An Introduction to Parallel Programming