Lecture 5 4 Multi Core Programming Concept II
-
Upload
cheong-ke-kent -
Category
Documents
-
view
225 -
download
0
Transcript of Lecture 5 4 Multi Core Programming Concept II
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
1/45
UCCD3213 MULTICORE
PROGRAMMINGMulticore Programming Concept II:
Performance Concept
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
2/45
Introduction to Performance Concepts
Performance Concepts:
Simple Speedup
Computing Speedup
Efficiency
Granularity Load Balance
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
3/45
Simple Speedup
Speedup measures the time required for a parallel program execute versus
the time the best serial code requires to accomplish the same task.
Speedup = Serial Time / Parallel Time
According to Amdahls law, speedup is a function of the fraction of a
program that is parallel and by how much that fraction is accelerated.
Speedup = 1 / [S+(1-S)/n +H(n)]
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
4/45
Computing Speedup Example
Painting a picket fence requires:
30 minutes of preparation (serial).
One minute to paint a single picket.
30 minutes to clean up (serial).
Example: Painting a Picket Fence
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
5/45
Computing Speedup
Number of Painters Time Speedup
1 30 + 300 + 30 = 360 1.0X
2 30 + 150 + 30 = 210 1.7X
10 30 + 30 + 30 = 90 4.0X
100 30 + 3 + 30 = 63 5.7X
Infinite 30 + 0 + 30 = 60 6.0X
Consider how speedup is computed for different numbers of
painters:
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
6/45
Parallel Efficiency
Parallel Efficiency:
Is a measure of how efficiently processor resources are used during
parallel computations.
Is equal to (Speedup / Number of Threads) * 100%.
Consider how efficiency is computed with different numbers of
painters:
Number of Painters Time Speedup Efficiency
1 360 1.0X 100%
2 30 + 150 + 30 = 210 1.7X 85%
10 30 + 30 + 30 = 90 4.0X 40%
100 30 + 3 + 30 = 63 5.7X 5.7%
Infinite 30 + 0 + 30 = 60 6.0X very low
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
7/45
7
Granularity
Definition:
An approximation of the ratio of
computation to synchronization.
The two types of granularity are:
Coarse-grained: Concurrent
calculations that have a large amount of
computation between synchronization
operations are known as coarse-grained.
Fine-grained: Cases where there is
very little computation between
synchronization events are known as
fine-grained.
Example: Field and Farmers
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
8/45
8
Load Balance
Load balancingrefers to the distribution of work across multiple threads so
that they all perform roughly the same amount of work.
Most effective distribution is such that:
Threads perform equal amounts of work.
Threads that finish first sit idle.
Threads finish work close to the same time.
Example: Cleaning Banquet Tables
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
9/45
COMPUTING SPEED UP
What speedup should I expect?
y
Amdahls Lawy Gustafsons Law
y Work and Span Law
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
10/45
AMDAHLS LAW
Amdahl started with the clear statement that
program speedup is a function of the fraction of a
program that is accelerated and by how much
that fraction is accelerated
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
11/45
AMDAHLS LAW
So, if you could speed up half the program by 15
percent, youd get:
Speedup=1/((1.50)+(.50/1.15))=1/(.50+.43)=1.08
This result is a speed increase of 8 percent, which
is what youd expect. If half of the program is
improved 15 percent, then the whole program isimproved by half that amount.
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
12/45
AMDAHLS LAW
In this equation, S is the time spent executing
the serial portion of the parallelized versionand n is the number of processor cores.
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
13/45
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
14/45
AMDAHL'S LAW
A parallel application can not run faster than the
sum of its sequential parts!
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
15/45
AMDAHL'S LAW
Assume a sequential program, with serial
execution time T
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
16/45
AMDAHL'S LAW
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
17/45
AMDAHL'S LAW
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
18/45
AMDAHL'S LAW
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
19/45
AMDAHL'S LAW
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
20/45
AMDAHL'S LAW
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
21/45
AMDAHL'S LAW
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
22/45
AMDAHL'S LAW
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
23/45
WORK& SPAN LAW: AMDAHLS LAW
Gene M. Amdahl
If50% of your application is
parallel and 50% is serial, you
cant get more than a factor of2
speedup, no matter how many
processors it runs on.*
*In general, if a fraction of an application
can be run in parallel and the rest must run
serially, the speedup is at most 1/(1).
But, whose application can be decomposed into
just a serial part and a parallel part? For my
application, what speedup should I expect?
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
24/45
WORK& SPAN LAW: MEASUREMENTS
MATTER
Q: What does the performance of a program on 1
and 2 cores tell you about its expected
performance on 16 or 64 cores?
A:Almost nothingy Many parallel programs cant exploit more than a few
cores.
y To predict the scalability of a program to many cores,
you need to know the amount ofparallelism exposed
by the code.y Parallelism is not a gut feel metric, but a
computable and measurable quantity.
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
25/45
int fib (int n) {if (n
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
26/45
WORK& SPAN LAW: COMPUTATION DAG
A parallel instruction stream is a dag G = (V, E ).
Each vertex v V is a strand: a sequence of instructions not
containing a call, spawn, sync, or return (or thrown exception).
An edge e E is a spawn, call, return, orcontinue edge.
Loop parallelism (cilk_for) is converted to spawns andsyncs using recursive divide-and-conquer.
spawn edgereturn edge
continue edge
initial strand final st
rand
strand
call edge
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
27/45
TP = execution time on P processors
WORK& SPAN LAW: PERFORMANCE
MEASURES
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
28/45
TP = execution time on P processors
T1 = work
WORK& SPAN LAW: PERFORMANCE
MEASURES
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
29/45
TP = execution time on P processors
*Also called critical-path length
orcomputational depth.
T1 = work T = span*
WORK& N W: RFORMAN
MEASURES
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
30/45
TP = execution time on P processors
T1 = work T = span*
*Also called critical-path length
orcomputational depth.
WORK LAW TP T1/P
SPAN LAW TP T
WORK& SPAN LAW: PERFORMANCE
MEASURES
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
31/45
Work: T
1(A
) =
WORK& SPAN LAW: SERIES
COMPOSITION
A B
Work: T
1(A
B) =T
1(A) +T
1(B)Span: T(AB) = T(A) + T(B)Span: T(AB) =
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
32/45
WORK& SPAN LAW: PARALLEL
COMPOSITION
A
B
Span: T(AB) = max{T(A), T(B)}Span: T(AB) =
Work: T1(AB) =Work: T1(AB) = T1(A) + T1(B)
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
33/45
Def. T1/TP = speedup on P processors.
IfT1/TP = (P), we have linearspeedup,
= P, we haveperfect linearspeedup,
> P, we have superlinearspeedup,
which is not possible in this performance model,
because of the Work Law TP T1/P.
WORK& SPAN LAW: SPEEDUP
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
34/45
WORK& SPAN LAW: PARALLELISMBecause the Span Law dictates thatTP T, the maximum possiblespeedup given T1 and T is
T1/T = parallelism
= the averageamount of workper step alongthe span.
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
35/45
Parallelism: T1/T =Parallelism: T1/T = 2.125
Work: T1 = 17Work: T1 =
Span: T = 8Span: T =
WORK& SPAN LAW: EXAMPLE: FIB(4)
Assume for
simplicity that each
strand in fib(4)
takes unit time toexecute.
4
5
6
1
2 7
8
3
sing many more than 2 processors can
yield only marginal performance gains.
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
36/45
WORK& SPAN LAW: DEVELOPING
SOCRATES
For the competition, Socrates was to run on a 512-processor Connection Machine Model CM5supercomputer at the University ofIllinois.
The developers had easy access to a similar 32-processor CM5 at MIT.
One of the developers proposed a change to theprogram that produced a speedup of over 20% on theMIT machine.
After a back-of-the-envelope calculation, the proposedimprovement was rejected!
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
37/45
T32 = 2048/32 + 1
= 65 seconds = 40 seconds
T32 = 1024/32 + 8
WORK& SPAN LAW: SOCRATES PARADOX
TP } T1/P + T
Original program Proposed program
T32 = 65 seconds T32 = 40 seconds
T1 = 2048 secondsT = 1 second
T1 = 1024 secondsT = 8 seconds
T512 = 2048/512 + 1
= 5 seconds
T512 = 1024/512 + 8
= 10 seconds
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
38/45
WORK& SPAN LAW: MORAL OF THE
STORY
WorkWork andand spanspan
predict performancepredict performance
betterbetter thanthan runningrunning
timestimes alone can.alone can.
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
39/45
GUSTAFSONS LAW
Amdahl's Law indicates that the speedup from
parallelizing any computing problem is
inherently limited by the presence of serial (non-
parallelizable) portions.
Gustafson argues that, as processor power
increases, the size of the problem set also tends
to increase.
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
40/45
GUSTAFSONS LAW
To cite one obvious example: as mainstream
computational resources have increased,
computer games have become far more
sophisticated, both in terms of user-interface
characteristics and in terms of the underlying
physics and other logic.
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
41/45
GUSTAFSONS LAW
Because Amdahl's Law cannot address this
relationship, Gustafson modifies Amdahl's work
according to the precept (based on experimental
findings at Sandia) that the overall problem size
should increase proportionally to the number of
processor cores (N), while the size of the serial
portion of the problem should remain constant as
N increases.
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
42/45
GUSTAFSONS LAW
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
43/45
GUSTAFSONS LAW
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
44/45
GUSTAFSONS LAW
Clearly, these calculations show that the
performance result continues to scale upward as
more processor cores are applied to the
computational load.
It's also worth noting that the per-core efficiency
trends downward as additional cores are added,
although the data in Table 2 shows the decrease
in per-core efficiency between the two-core case
and the four-core case to be greater than theentire decrease between four cores and 1024
cores.
-
8/7/2019 Lecture 5 4 Multi Core Programming Concept II
45/45
REFERENCES
Intel Software Network
http://software.intel.com/en-us/articles/amdahls-
law-gustafsons-trend-and-the-performance-
limits-of-parallel-applications/
Work and Span Laws
http://www.cprogramming.com/parallelism.html