Parallel Programming MethodologyŸ1.pdf · Software & Services Group Developer Products Division...

Software & Services Group

Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Parallel Programming Methodology

11/13/2014 1

http://software.intel.com/en-us/articles/optimization-notice/




Agenda

11/13/2014 2

• Basic concepts and terminology

• Types of Parallelism in Intel® Processors

– Instruction Level Parallelism

– Data Level Parallelism

– Thread Level Parallelism

• Programming Methods Overview

• Summary





Processes and Threads

• Modern operating systems load programs as processes

• A process starts executing at its entry point as a thread

• Threads can create other threads within the process

• All threads within a process share code and data segments

Code Segment

Data Segment

Thread

main()

… ThreadThread

11/13/2014 3





Concurrency vs. Parallelism

• Concurrency: Two or more threads in progress at the same time

Thread 1

Thread 2

Thread 1

Thread 2

Parallelism: Two or more threads are executing at the same time

11/13/2014 4





Why Use Threads?

• Benefits

– Increased performance

–Better resource utilization

• Risks

– Increases complexity of application

–Difficult to debug (data races, deadlocks, etc.)

11/13/2014 5





Threading – When, Why, How?

• When to thread?– Independent tasks that can execute concurrently

• Why to thread?– Improve turnaround or throughput

• How to thread?– Functional decomposition or data decomposition

11/13/2014 6





Turnaround

•Complete single task in the smallest amount of time

•Example: Setting a dinner table

–One to put down plates

–One to fold and place napkins

–One to place utensils

–One to place glasses

11/13/2014 7





Throughput

• Complete more tasks in a fixed amount of time

• Example: Setting up banquet tables

–Multiple waiters each do separate tables

–Specialized waiters for plates, glasses, utensils, etc.

11/13/2014 8





Functional Decomposition

• Divide work according to functional differences

• Best used when the amount of work scales with the number of independent tasks

Example: Building a house consists of many different tasks (e.g., bricklayer, carpenter, roofer, plumber, etc.)

11/13/2014 9





Data Decomposition

• Divide work according to independent data

• Best used when the amount of work scales with the amount of data

• Example: Grading exams

11/13/2014 10





Static vs. Dynamic Scheduling

• What is the best way to divide the work?

–Problem 1:

– 1000 exams

– 4 graders

– 1 answer key

• What if the exams are not the same?

–Problem 2:

– 1000 exams

– 4 graders

– 4 answer keys

11/13/2014 11





Parallel Performance

• There are many ways to use parallelism to improve turnaround or throughput

• Examples

–Automobile assembly line

– Each worker does an assigned function

–Searching for pieces of Skylab

– Divide up area to be searched

–Postal service

– Post office branches, mail sorters, delivery

11/13/2014 12





Race Conditions

• Parallel processes can “race” against each other for resources

• Race conditions occur when execution order is assumed but not guaranteed

• Example: Unsynchronized access to bank account

Deposits $100 into account

Withdraws $100 from account

Initial balance = $1000Final balance = ?

11/13/2014 13





Race Conditions

Deposits $100 into account

Withdraws $100 from account

Initial balance = $1000Final balance = ?

Time Withdrawal Deposit

T0 Load (balance = $1000)

T1 Subtract $100 Load (balance = $1000)

T2 Store (balance = $900) Add $100

T3 Store (balance = $1100)

11/13/2014 14





Critical Regions and Mutual Exclusion

• Critical Region

–A block of code that contains side-effects, e.g.:– Updates global data

– Performs I/O

– In the previous example, access to the bank account is a critical region

• Mutual Exclusion

–Program logic to enforce single thread access to critical region

–Enables correct programming structures for avoiding race conditions

11/13/2014 15





Mutual Exclusion

• Synchronization object used to enforce mutual exclusion to critical region

–While someone “holds” a lock, others must wait

–The owner of the lock may enter the critical region

–When done, the holder releases the lock and someone else can acquire it

• Example: Library book

–One patron checks out a book

–Other patrons must wait for return of book

11/13/2014 16





9

Synchronization: Barriers

• Threads pause at barrier

• When all threads arrive, all are released

• While waiting, threads are idle

• Example: Race starting line

11/13/2014 17





Deadlock

• Parallel processes wait for some event or condition that cannot happen

• Example: Two books are needed to complete an assignment. A student has checked-out the first book but another student has checked-out the second book.

wantschecked-out by

wantschecked-out by

11/13/2014 18





Amdahl’s Law

• Parallel speedup is limited by the amount of serial code in an application.

Serial

Ph

as

e 1

Ph

as

e 2

Parallel

Ph

as

e 2

11/13/2014 19





Parallel Speedup

• Measure of how much faster the computation executes versus the best serial time

–Serial time divided by parallel time

• Example: Painting a picket fence

–30 minutes of preparation (serial)

–One minute to paint a single picket

–30 minutes of cleanup (serial)

Thus, 300 pickets takes 360 minutes (serial time)

11/13/2014 20





Computing Speedup

• What if fence owner uses spray gun to paint 300 pickets in one hour?

– Better serial algorithm

– If no spray guns are available for multiple workers, what is maximum parallel speedup?

Number of painters

Time Speedup

1 30 + 300 + 30 = 360 1.0X

2 30 + 150 + 30 = 210 1.7X

10 30 + 30 + 30 = 90 4.0X

100 30 + 3 + 30 = 63 5.7X

Infinite 30 + 0 + 30 = 60 6.0X

Illustrates

Amdahl’s Law

Potential speedup

is restricted by

serial code

11/13/2014 21





Efficiency

• Measure of how effectively computation resources are kept busy

–Speedup divided by number of processors

–Expressed as average percentage of non-idle time

Number of painters

Time Speedup Efficiency

1 360 1.0X 100%

2 30 + 150 + 30 = 210 1.7X 85%

10 30 + 30 + 30 = 90 4.0X 40%

100 30 + 3 + 30 = 63 5.7X 5.7%

Infinite 30 + 0 + 30 = 60 6.0X Very low

11/13/2014 22





Granularity

• Loosely defined as the ratio of computation to synchronization

• Be sure there is enough work to merit parallel computation

• Example: Two farmers divide a field. How many more farmers can be added?

11/13/2014 23





Load Balance

• Most effective distribution is to have equal amounts of work per processor; otherwise, some processors sit idle

• Example: Busing banquet tables

–Better to assign same number of tables to each person

11/13/2014 24





Parallelism in Intel® Processor Based Platforms

– ILP - Instruction Level Parallelism

– Pipelined Execution

– Super-scalar execution

– DLP - Data Level Parallelism

– SSE vector processing

– SIMD : Singe Instruction Multiple Data

– TLP - Thread-Level Parallelism– Hardware support for hyper-threading

– Multi-core architecture

– Cache-coherent multiple sockets

– CLP - Cluster Level Parallelism

– Multiple platforms connected via interconnection network

– No hardware-supported cache coherence





DLP – Data Level Parallelism in Intel® ProcessorsSIMD: Single Instruction, Multiple Data

+

Scalar processing– traditional mode

– one operation producesone result

SIMD processing– with SSE / SSE2

– one operation produces

multiple results

X

Y

X + Y

+

x3 x2 x1 x0

y3 y2 y1 y0

x3+y3 x2+y2 x1+y1 x0+y0

X

Y

X + Y

= =





Virtual

Container

VMM:

TLP – Thread Level ParallelismMany Ways to Create Multiple Threads

OS OS OS

Application: Thread Thread Thread

Operating

System: App App App

Threading individual applications is the only

way to achieve sufficient TLP in general !





Methods Overview

•Parallel programming methods–Thread Libraries

– Win32 API

– POSIX threads

–Compiler Directives– OpenMP*

–Message Passing– MPI

–Threaded Runtime Libraries– Intel® Math Kernel Libraries (MKL), Integrated

Performance Primitives (IPP) and Threading Building Blocks (TBB)

•Choosing which API is for you





Parallel APIs: OpenMP*

omp_set_lock(lck)

#pragma omp parallel for private(A, B)

#pragma omp critical

C$OMP parallel do shared(a, b, c)

C$OMP PARALLEL REDUCTION (+: A,

B)

call OMP_INIT_LOCK (ilok)

call omp_test_lock(jlok)

setenv OMP_SCHEDULE

“dynamic”

CALL

OMP_SET_NUM_THREADS(10)

C$OMP DO lastprivate(XX)

C$OMP

ORDERED

C$OMP SINGLE

PRIVATE(X)

C$OMP

SECTIONS

C$OMP

MASTERC$OMP

ATOMIC

C$OMP

FLUSH

C$OMP PARALLEL DO ORDERED PRIVATE (A, B,

C)

C$OMP

THREADPRIVATE(/ABC/)

C$OMP PARALLEL

COPYIN(/blk/)

Nthrds =

OMP_GET_NUM_PROCS()

!$OMP

BARRIER

OpenMP: An API for Writing Multithreaded Applications

• A set of compiler directives and library routines for

parallel application programmers

• Makes it easy to create multithreaded (MT) programs in

Fortran, C and C++

• Standardizes last 17 years of SMP practice





Methods Overview

•Parallel programming methods–Thread Libraries

– Win32 API

– POSIX threads

–Compiler Directives– OpenMP*

–Message Passing– MPI

–Threaded Runtime Libraries– Intel® Math Kernel Libraries (MKL), Integrated

Performance Primitives (IPP) and Threading Building Blocks (TBB)

•Choosing which API is for you





Parallel API’s: MPI: The Message Passing Interface

omp_set_lock(lck)MPI_Bsend_init

MPI_Pack

MPI_Sendrecv_replace

MPI_Recv_init

MPI_Allgatherv

MPI_Unpac

k

MPI_Sendrecv

MPI_Bcast

MPI_Ssend

C$OMP

ORDERED

MPI_Startall

MPI_Test_cancelled

MPI_Type_free

MPI_Type_contiguous

MPI_Barrier

MPI_Start

MPI_COMM_WO

RLD

MPI_Rec

v

MPI_Send

MPI_Waitall

MPI_Reduc

e

MPI_Group_compa

re

MPI_Sca

n

MPI_Group_size

MPI_Errhandler_create

MPI: An API for Writing Clustered Applications

• A library of routines to coordinate the execution of

multiple processes.

• Provides point to point and collective communication in

Fortran, C and C++

• Unifies last 20 years of cluster computing and MPP

practice





Programming with MPI

• You need to setup the environment … define the context for a group of processes

MPI_Init(&argc, &argv) ;

MPI_Comm_Rank(MPI_COMM_WORLD, &my_id) ;

MPI_Comm_Size(MPI_COMM_WORLD, &numprocs) ;

• MPI uses multiple processes … nothing is shared by default. Only by sharing messages.

• Many MPI programs only include simple collective communication … for example a reduction:

MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD) ;

Take a local

value

Combine into a

single valueUsing this op

(e.g. SUM)

And send the

answer to this

process ID

Give each process an

ID ranging from 0 to

(numprocs-1)





How do most people use MPI:The Single Program Multiple data Pattern

SPMD pattern:

• Replicate the program.

• Add glue code (based on ID)

• Break up the data

A sequential program

working on a data set

A parallel program working on a decomposed

data set. Coordination by passing messages.





How to mix MPI and OpenMP* in one program?

SPMD pattern:

• Replicate the program.

• Add glue code (based on ID)

• Break up the data

A sequential program

working on a data set

Create the MPI program with its data

decomposition.

Use OpenMP inside each MPI process.





Summary

• At least 4 levels of parallelism

• Intel® Developer Products supported developer to extract best possible performance from each level

• Currently there are multiple methods to introduce threading and parallelism

• To facilitate parallel programming in the future, models are needed which automatically assign compute load to all levels of parallelism





Optimization Notice

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that

are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and

other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on

microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended

for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for

Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information

regarding the specific instruction sets covered by this notice.

Notice revision #20110804

11/13/2014 37





Legal Disclaimer

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.

BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries.*Other names and brands may be claimed as the property of others.

Copyright © 2012. Intel Corporation.

http://intel.com/software/products

11/13/2014 38



http://www.intel.com/software/products

http://intel.com/software/products

Parallel Programming MethodologyŸ1.pdf · Software & Services Group Developer Products Division...

Documents

Transcript of Parallel Programming MethodologyŸ1.pdf · Software & Services Group Developer Products Division...