Parallel Programming MethodologyŸ1.pdf · Software & Services Group Developer Products Division...
Transcript of Parallel Programming MethodologyŸ1.pdf · Software & Services Group Developer Products Division...
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Parallel Programming Methodology
11/13/2014 1
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Agenda
11/13/2014 2
• Basic concepts and terminology
• Types of Parallelism in Intel® Processors
– Instruction Level Parallelism
– Data Level Parallelism
– Thread Level Parallelism
• Programming Methods Overview
• Summary
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Processes and Threads
• Modern operating systems load programs as processes
• A process starts executing at its entry point as a thread
• Threads can create other threads within the process
• All threads within a process share code and data segments
Code Segment
Data Segment
Thread
main()
… ThreadThread
11/13/2014 3
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Concurrency vs. Parallelism
• Concurrency: Two or more threads in progress at the same time
Thread 1
Thread 2
Thread 1
Thread 2
Parallelism: Two or more threads are executing at the same time
11/13/2014 4
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Why Use Threads?
• Benefits
– Increased performance
–Better resource utilization
• Risks
– Increases complexity of application
–Difficult to debug (data races, deadlocks, etc.)
11/13/2014 5
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Threading – When, Why, How?
• When to thread?– Independent tasks that can execute concurrently
• Why to thread?– Improve turnaround or throughput
• How to thread?– Functional decomposition or data decomposition
11/13/2014 6
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Turnaround
•Complete single task in the smallest amount of time
•Example: Setting a dinner table
–One to put down plates
–One to fold and place napkins
–One to place utensils
–One to place glasses
11/13/2014 7
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Throughput
• Complete more tasks in a fixed amount of time
• Example: Setting up banquet tables
–Multiple waiters each do separate tables
–Specialized waiters for plates, glasses, utensils, etc.
11/13/2014 8
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Functional Decomposition
• Divide work according to functional differences
• Best used when the amount of work scales with the number of independent tasks
Example: Building a house consists of many different tasks (e.g., bricklayer, carpenter, roofer, plumber, etc.)
11/13/2014 9
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Data Decomposition
• Divide work according to independent data
• Best used when the amount of work scales with the amount of data
• Example: Grading exams
11/13/2014 10
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Static vs. Dynamic Scheduling
• What is the best way to divide the work?
–Problem 1:
– 1000 exams
– 4 graders
– 1 answer key
• What if the exams are not the same?
–Problem 2:
– 1000 exams
– 4 graders
– 4 answer keys
11/13/2014 11
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Parallel Performance
• There are many ways to use parallelism to improve turnaround or throughput
• Examples
–Automobile assembly line
– Each worker does an assigned function
–Searching for pieces of Skylab
– Divide up area to be searched
–Postal service
– Post office branches, mail sorters, delivery
11/13/2014 12
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Race Conditions
• Parallel processes can “race” against each other for resources
• Race conditions occur when execution order is assumed but not guaranteed
• Example: Unsynchronized access to bank account
Deposits $100 into account
Withdraws $100 from account
Initial balance = $1000Final balance = ?
11/13/2014 13
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Race Conditions
Deposits $100 into account
Withdraws $100 from account
Initial balance = $1000Final balance = ?
Time Withdrawal Deposit
T0 Load (balance = $1000)
T1 Subtract $100 Load (balance = $1000)
T2 Store (balance = $900) Add $100
T3 Store (balance = $1100)
11/13/2014 14
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Critical Regions and Mutual Exclusion
• Critical Region
–A block of code that contains side-effects, e.g.:– Updates global data
– Performs I/O
– In the previous example, access to the bank account is a critical region
• Mutual Exclusion
–Program logic to enforce single thread access to critical region
–Enables correct programming structures for avoiding race conditions
11/13/2014 15
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Mutual Exclusion
• Synchronization object used to enforce mutual exclusion to critical region
–While someone “holds” a lock, others must wait
–The owner of the lock may enter the critical region
–When done, the holder releases the lock and someone else can acquire it
• Example: Library book
–One patron checks out a book
–Other patrons must wait for return of book
11/13/2014 16
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
9
Synchronization: Barriers
• Threads pause at barrier
• When all threads arrive, all are released
• While waiting, threads are idle
• Example: Race starting line
11/13/2014 17
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Deadlock
• Parallel processes wait for some event or condition that cannot happen
• Example: Two books are needed to complete an assignment. A student has checked-out the first book but another student has checked-out the second book.
wantschecked-out by
wantschecked-out by
11/13/2014 18
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Amdahl’s Law
• Parallel speedup is limited by the amount of serial code in an application.
Serial
Ph
as
e 1
Ph
as
e 2
Parallel
Ph
as
e 2
11/13/2014 19
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Parallel Speedup
• Measure of how much faster the computation executes versus the best serial time
–Serial time divided by parallel time
• Example: Painting a picket fence
–30 minutes of preparation (serial)
–One minute to paint a single picket
–30 minutes of cleanup (serial)
Thus, 300 pickets takes 360 minutes (serial time)
11/13/2014 20
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Computing Speedup
• What if fence owner uses spray gun to paint 300 pickets in one hour?
– Better serial algorithm
– If no spray guns are available for multiple workers, what is maximum parallel speedup?
Number of painters
Time Speedup
1 30 + 300 + 30 = 360 1.0X
2 30 + 150 + 30 = 210 1.7X
10 30 + 30 + 30 = 90 4.0X
100 30 + 3 + 30 = 63 5.7X
Infinite 30 + 0 + 30 = 60 6.0X
Illustrates
Amdahl’s Law
Potential speedup
is restricted by
serial code
11/13/2014 21
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Efficiency
• Measure of how effectively computation resources are kept busy
–Speedup divided by number of processors
–Expressed as average percentage of non-idle time
Number of painters
Time Speedup Efficiency
1 360 1.0X 100%
2 30 + 150 + 30 = 210 1.7X 85%
10 30 + 30 + 30 = 90 4.0X 40%
100 30 + 3 + 30 = 63 5.7X 5.7%
Infinite 30 + 0 + 30 = 60 6.0X Very low
11/13/2014 22
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Granularity
• Loosely defined as the ratio of computation to synchronization
• Be sure there is enough work to merit parallel computation
• Example: Two farmers divide a field. How many more farmers can be added?
11/13/2014 23
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Load Balance
• Most effective distribution is to have equal amounts of work per processor; otherwise, some processors sit idle
• Example: Busing banquet tables
–Better to assign same number of tables to each person
11/13/2014 24
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Parallelism in Intel® Processor Based Platforms
– ILP - Instruction Level Parallelism
– Pipelined Execution
– Super-scalar execution
– DLP - Data Level Parallelism
– SSE vector processing
– SIMD : Singe Instruction Multiple Data
– TLP - Thread-Level Parallelism– Hardware support for hyper-threading
– Multi-core architecture
– Cache-coherent multiple sockets
– CLP - Cluster Level Parallelism
– Multiple platforms connected via interconnection network
– No hardware-supported cache coherence
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
DLP – Data Level Parallelism in Intel® ProcessorsSIMD: Single Instruction, Multiple Data
+
Scalar processing– traditional mode
– one operation producesone result
SIMD processing– with SSE / SSE2
– one operation produces
multiple results
X
Y
X + Y
+
x3 x2 x1 x0
y3 y2 y1 y0
x3+y3 x2+y2 x1+y1 x0+y0
X
Y
X + Y
= =
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Virtual
Container
VMM:
TLP – Thread Level ParallelismMany Ways to Create Multiple Threads
OS OS OS
Application: Thread Thread Thread
Operating
System: App App App
Threading individual applications is the only
way to achieve sufficient TLP in general !
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Methods Overview
•Parallel programming methods–Thread Libraries
– Win32 API
– POSIX threads
–Compiler Directives– OpenMP*
–Message Passing– MPI
–Threaded Runtime Libraries– Intel® Math Kernel Libraries (MKL), Integrated
Performance Primitives (IPP) and Threading Building Blocks (TBB)
•Choosing which API is for you
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Parallel APIs: OpenMP*
omp_set_lock(lck)
#pragma omp parallel for private(A, B)
#pragma omp critical
C$OMP parallel do shared(a, b, c)
C$OMP PARALLEL REDUCTION (+: A,
B)
call OMP_INIT_LOCK (ilok)
call omp_test_lock(jlok)
setenv OMP_SCHEDULE
“dynamic”
CALL
OMP_SET_NUM_THREADS(10)
C$OMP DO lastprivate(XX)
C$OMP
ORDERED
C$OMP SINGLE
PRIVATE(X)
C$OMP
SECTIONS
C$OMP
MASTERC$OMP
ATOMIC
C$OMP
FLUSH
C$OMP PARALLEL DO ORDERED PRIVATE (A, B,
C)
C$OMP
THREADPRIVATE(/ABC/)
C$OMP PARALLEL
COPYIN(/blk/)
Nthrds =
OMP_GET_NUM_PROCS()
!$OMP
BARRIER
OpenMP: An API for Writing Multithreaded Applications
• A set of compiler directives and library routines for
parallel application programmers
• Makes it easy to create multithreaded (MT) programs in
Fortran, C and C++
• Standardizes last 17 years of SMP practice
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Methods Overview
•Parallel programming methods–Thread Libraries
– Win32 API
– POSIX threads
–Compiler Directives– OpenMP*
–Message Passing– MPI
–Threaded Runtime Libraries– Intel® Math Kernel Libraries (MKL), Integrated
Performance Primitives (IPP) and Threading Building Blocks (TBB)
•Choosing which API is for you
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Parallel API’s: MPI: The Message Passing Interface
omp_set_lock(lck)MPI_Bsend_init
MPI_Pack
MPI_Sendrecv_replace
MPI_Recv_init
MPI_Allgatherv
MPI_Unpac
k
MPI_Sendrecv
MPI_Bcast
MPI_Ssend
C$OMP
ORDERED
MPI_Startall
MPI_Test_cancelled
MPI_Type_free
MPI_Type_contiguous
MPI_Barrier
MPI_Start
MPI_COMM_WO
RLD
MPI_Rec
v
MPI_Send
MPI_Waitall
MPI_Reduc
e
MPI_Group_compa
re
MPI_Sca
n
MPI_Group_size
MPI_Errhandler_create
MPI: An API for Writing Clustered Applications
• A library of routines to coordinate the execution of
multiple processes.
• Provides point to point and collective communication in
Fortran, C and C++
• Unifies last 20 years of cluster computing and MPP
practice
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Programming with MPI
• You need to setup the environment … define the context for a group of processes
MPI_Init(&argc, &argv) ;
MPI_Comm_Rank(MPI_COMM_WORLD, &my_id) ;
MPI_Comm_Size(MPI_COMM_WORLD, &numprocs) ;
• MPI uses multiple processes … nothing is shared by default. Only by sharing messages.
• Many MPI programs only include simple collective communication … for example a reduction:
MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD) ;
Take a local
value
Combine into a
single valueUsing this op
(e.g. SUM)
And send the
answer to this
process ID
Give each process an
ID ranging from 0 to
(numprocs-1)
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
How do most people use MPI:The Single Program Multiple data Pattern
SPMD pattern:
• Replicate the program.
• Add glue code (based on ID)
• Break up the data
A sequential program
working on a data set
A parallel program working on a decomposed
data set. Coordination by passing messages.
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
How to mix MPI and OpenMP* in one program?
SPMD pattern:
• Replicate the program.
• Add glue code (based on ID)
• Break up the data
A sequential program
working on a data set
Create the MPI program with its data
decomposition.
Use OpenMP inside each MPI process.
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Summary
• At least 4 levels of parallelism
• Intel® Developer Products supported developer to extract best possible performance from each level
• Currently there are multiple methods to introduce threading and parallelism
• To facilitate parallel programming in the future, models are needed which automatically assign compute load to all levels of parallelism
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 11/13/2014 36
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for
Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information
regarding the specific instruction sets covered by this notice.
Notice revision #20110804
11/13/2014 37
Software & Services Group
Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Legal Disclaimer
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.
BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries.*Other names and brands may be claimed as the property of others.
Copyright © 2012. Intel Corporation.
http://intel.com/software/products
11/13/2014 38