Lecture #2: Architecture, Theory & Patterns | February 1st, 2011
Nicolas Pinto (MIT, Harvard) [email protected]
Massively Parallel ComputingCS 264 / CSCI E-292
Objectives
• introduce important computational thinking skills for massively parallel computing
• understand hardware limitations
• understand algorithm constraints
• identify common patterns
During this course,
we’ll try to
and use existing material ;-)
“ ”
adapted for CS264
Outline
• Thinking Parallel
• Architecture
• Programming Model
• Bits of Theory
• Patterns
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
F
!"#$%&"'()'*+&+,,",'-./0%$123
!"#"$$%$&'()*+,-./&-0&"&1(#)&(1&'()*+,-./&-.&
23-'3&)".4&-.0,#+',-(.0&"#%&'"##-%5&(+,&
0-)+$,".%(+0$441510"61+
! 7&+61$1.2+,,8)',+&3"9'":0"2;1<"9';0"#1+,1="6
! >:.$1#'?%0"&#./0%$"&;'@>3)'-&+8A
! B1;$&1C%$"6'?8;$"/;'@>3)'D?-E'4F1$"9'G,%"H"2"A'
! *+&+,,",'#./0%$123'I+;'$&+61$1.2+,,8'
12+##";;1C,"'$.'$F"'#.//.61$8'/+&5"$0,+#"
!"#$%&'()*$+,-.%/'0%(,1,(2(%&'()'1$1-%&'3-3%#43%
-.'#%"0%5&",&"&#",%&(1&#(+/3$4&"&1"',(#&(1&,2(&*%#&4%"#&666&
7%#,"-.$4&(8%#&,3%&03(#,&,%#)&,3-0&#",%&'".&9%&%:*%',%5&,(&
'(.,-.+%;&-1&.(,&,(&-.'#%"0%6&<8%#&,3%&$(./%#&,%#);&,3%&
#",%&(1&-.'#%"0%&-0&"&9-,&)(#%&+.'%#,"-.;&"$,3(+/3&,3%#%&-0&
.(&#%"0(.&,(&9%$-%8%&-,&2-$$&.(,&#%)"-.&.%"#$4&'(.0,".,&1(#&
",&$%"0,&=>&4%"#06&?3",&)%".0&94&=@AB;&,3%&.+)9%#&(1&
'()*(.%.,0&*%#&-.,%/#",%5&'-#'+-,&1(#&)-.-)+)&'(0,&2-$$&
9%&CB;>>>6&D&9%$-%8%&,3",&0+'3&"&$"#/%&'-#'+-,&'".&9%&9+-$,&
'1%4%3,15*$%64/$07
H.&6.2'J..&"9'>,"#$&.21#;'J+3+=12"9'KL'D0&1,'KLMN
! 7F"'/.;$'"#.2./1#'2%/C"&'.O'#./0.2"2$;'
12'+2'E-'I1,,'6.%C,"'"<"&8'8"+&
! P1;$.&1#+,,8'!-*Q;'3"$'O+;$"&
"P+&6I+&"'&"+#F123'O&"R%"2#8',1/1$+$1.2;
! S.I'!-*Q;'3"$'I16"&
! T+$F"&'$F+2'":0"#$123'-*Q;'$.'3"$'$I1#"'+;'
O+;$9'":0"#$'$.'F+<"'$I1#"'+;'/+28U
! *+&+,,",'0&.#";;123'O.&'$F"'/+;;";
! Q2O.&$%2+$",8)'*+&+,,",'0&.3&+//123'1;'F+&6V''
"D,3.&1$F/;'+26'B+$+'?$&%#$%&";'/%;$'C"'O%26+/"2$+,,8'&"6";132"6
slide by Matthew Bolitho
Motivation
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
F
!"#$%&"'()'*+&+,,",'-./0%$123
!"#"$$%$&'()*+,-./&-0&"&1(#)&(1&'()*+,-./&-.&
23-'3&)".4&-.0,#+',-(.0&"#%&'"##-%5&(+,&
0-)+$,".%(+0$441510"61+
! 7&+61$1.2+,,8)',+&3"9'":0"2;1<"9';0"#1+,1="6
! >:.$1#'?%0"&#./0%$"&;'@>3)'-&+8A
! B1;$&1C%$"6'?8;$"/;'@>3)'D?-E'4F1$"9'G,%"H"2"A'
! *+&+,,",'#./0%$123'I+;'$&+61$1.2+,,8'
12+##";;1C,"'$.'$F"'#.//.61$8'/+&5"$0,+#"
!"#$%&'()*$+,-.%/'0%(,1,(2(%&'()'1$1-%&'3-3%#43%
-.'#%"0%5&",&"&#",%&(1&#(+/3$4&"&1"',(#&(1&,2(&*%#&4%"#&666&
7%#,"-.$4&(8%#&,3%&03(#,&,%#)&,3-0&#",%&'".&9%&%:*%',%5&,(&
'(.,-.+%;&-1&.(,&,(&-.'#%"0%6&<8%#&,3%&$(./%#&,%#);&,3%&
#",%&(1&-.'#%"0%&-0&"&9-,&)(#%&+.'%#,"-.;&"$,3(+/3&,3%#%&-0&
.(&#%"0(.&,(&9%$-%8%&-,&2-$$&.(,&#%)"-.&.%"#$4&'(.0,".,&1(#&
",&$%"0,&=>&4%"#06&?3",&)%".0&94&=@AB;&,3%&.+)9%#&(1&
'()*(.%.,0&*%#&-.,%/#",%5&'-#'+-,&1(#&)-.-)+)&'(0,&2-$$&
9%&CB;>>>6&D&9%$-%8%&,3",&0+'3&"&$"#/%&'-#'+-,&'".&9%&9+-$,&
'1%4%3,15*$%64/$07
H.&6.2'J..&"9'>,"#$&.21#;'J+3+=12"9'KL'D0&1,'KLMN
! 7F"'/.;$'"#.2./1#'2%/C"&'.O'#./0.2"2$;'
12'+2'E-'I1,,'6.%C,"'"<"&8'8"+&
! P1;$.&1#+,,8'!-*Q;'3"$'O+;$"&
"P+&6I+&"'&"+#F123'O&"R%"2#8',1/1$+$1.2;
! S.I'!-*Q;'3"$'I16"&
! T+$F"&'$F+2'":0"#$123'-*Q;'$.'3"$'$I1#"'+;'
O+;$9'":0"#$'$.'F+<"'$I1#"'+;'/+28U
! *+&+,,",'0&.#";;123'O.&'$F"'/+;;";
! Q2O.&$%2+$",8)'*+&+,,",'0&.3&+//123'1;'F+&6V''
"D,3.&1$F/;'+26'B+$+'?$&%#$%&";'/%;$'C"'O%26+/"2$+,,8'&"6";132"6
Motivation
slide by Matthew Bolitho
Thinking Parallel
Getting your feet wet
• Common scenario: “I want to make the algorithm X run faster, help me!”
• Q: How do you approach the problem?
How?
How?
• Option 1: wait
• Option 2: gcc -O3 -msse4.2
• Option 3: xlc -O5
• Option 4: use parallel libraries (e.g. (cu)blas)
• Option 5: hand-optimize everything!
• Option 6: wait more
What else ?
How about analysis ?
Getting your feet wet
0
25
50
75
100
load_data() foo() bar() yey()
50
1110
29
time
(s)
Algorithm X v1.0 Profiling Analysis on Input 10x10x10
sequential in nature
100% parallelizable
Q: What is the maximum speed up ?
Getting your feet wet
0
25
50
75
100
load_data() foo() bar() yey()
50
1110
29
time
(s)
Algorithm X v1.0 Profiling Analysis on Input 10x10x10
sequential in nature
100% parallelizable
A: 2X ! :-(
Getting your feet wet
0
2,250
4,500
6,750
9,000
load_data() foo() bar() yey()
9,000
300250350
time
(s)
Algorithm X v1.0 Profiling Analysis on Input 100x100x100
sequential in nature
100% parallelizable
Q: and now?
You need to...
• ... understand the problem (duh!)
• ... study the current (sequential?) solutions and their constraints
• ... know the input domain
• ... profile accordingly
• ... “refactor” based on new constraints (hw/sw)
A better way ?
Speculation: (input) domain-aware optimization using some sort of probabilistic modeling ?
...
doesn’t scale !
9 Some Perspective
Technical Problem to be Analyzed
Direct elimination equation solver
Discretization "A"
Scientific Model "A"
Sequential implementationParallel implementation
Iterative equation solver
Discretization "B"
Consultation with experts
Model "B"
Experiments
Theoretical analysis
Figure 11: The “problem tree” for scientific problem solving. There are many
options to try to achieve the same goal.
56
from Scott et al. “Scientific Parallel Computing” (2005)
Some PerspectiveThe “problem tree” for scientific problem solving
There are many options to try to achieve the same goal.
Computational Thinking
• translate/formulate domain problems into computational models that can be solved efficiently by available computing resources
• requires a deep understanding of their relationships
adapted from Hwu & Kirk (PASI 2011)
Architecture Algorithms
ParallelComputing
Languages
APPLICATIONS
Figure 3: Knowledge of algorithms, architecture, and languages contributes to ef-
fective use of parallel computers in practical applications.
9
adapted from Scott et al. “Scientific Parallel Computing” (2005)
Getting ready...
CompilersPatterns
Programming Models
Parallel Thinking
Fundamental Skills
• Computer architecture
• Programming models and compilers
• Algorithm techniques and patterns
• Domain knowledge
Computer Architecture
• memory organization, bandwidth and latency; caching and locality (memory hierarchy)
• floating-point precision vs. accuracy
• SISD, SIMD, MISD, MIMD vs. SIMT, SPMD
critical in understanding tradeoffs btw algorithms
Programming models
• parallel execution models (threading hierarchy)
• optimal memory access patterns
• array data layout and loop transformations
for optimal data structure and code execution
Algorithms and patterns
• toolbox for designing good parallel algorithms
• it is critical to understand their scalability and efficiency
• many have been exposed and documented
• sometimes hard to “extract”
• ... but keep trying!
Domain Knowledge
• abstract modeling
• mathematical properties
• accuracy requirements
• coming back to the drawing board to expose more/better parallelism ?
You can do it!
• thinking parallel is not as hard as you may think
• many techniques have been thoroughly explained...
• ... and are now “accessible” to non-experts !
Architecture
Architecture
• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
• CPUs to GPUs
Architecture
• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
• CPUs to GPUs
What’s in a computer?
Processor
Intel Q6600 Core2 Quad, 2.4 GHz
Die
(2×) 143 mm2, 2× 2 cores
582,000,000 transistors
∼ 100W
Memory
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
What’s in a computer?
Processor
Intel Q6600 Core2 Quad, 2.4 GHz
Die
(2×) 143 mm2, 2× 2 cores
582,000,000 transistors
∼ 100W
Memory
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
What’s in a computer?
Processor
Intel Q6600 Core2 Quad, 2.4 GHz
Die
(2×) 143 mm2, 2× 2 cores
582,000,000 transistors
∼ 100W
Memory
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
What’s in a computer?
Processor
Intel Q6600 Core2 Quad, 2.4 GHz
Die
(2×) 143 mm2, 2× 2 cores
582,000,000 transistors
∼ 100W
Memory
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
What’s in a computer?
Processor
Intel Q6600 Core2 Quad, 2.4 GHz
Die
(2×) 143 mm2, 2× 2 cores
582,000,000 transistors
∼ 100W
Memory
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Architecture
• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
A Basic Processor
Internal Bus
Register FileFlags
Data ALU
Address ALU
Control UnitPC
Memory Interface
Insn.fetch
Data Bus
Address Bus
(loosely based on Intel 8086)
Bonus Question:What’s a bus?
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
How all of this fits together
Everything synchronizes to the Clock.
Control Unit (“CU”): The brains of theoperation. Everything connects to it.
Bus entries/exits are gated and(potentially) buffered.
CU controls gates, tells other unitsabout ‘what’ and ‘how’:
• What operation?
• Which register?
• Which addressing mode?
Internal Bus
Register FileFlags
Data ALU
Address ALU
Control UnitPC
Memory Interface
Insn.fetch
Data Bus
Address Bus
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
What is. . . an ALU?
Arithmetic Logic Unit
One or two operands A, B
Operation selector (Op):
• (Integer) Addition, Subtraction
• (Logical) And, Or, Not
• (Bitwise) Shifts (equivalent to
multiplication by power of two)
• (Integer) Multiplication, Division
Specialized ALUs:
• Floating Point Unit (FPU)
• Address ALU
Operates on binary representations of
numbers. Negative numbers represented
by two’s complement.
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
What is. . . a Register File?
Registers are On-Chip Memory
• Directly usable as operands in
Machine Language
• Often “general-purpose”
• Sometimes special-purpose: Floating
point, Indexing, Accumulator
• Small: x86 64: 16×64 bit GPRs
• Very fast (near-zero latency)
%r0
%r1
%r2
%r3
%r4
%r5
%r6
%r7
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
How does computer memory work?
One (reading) memory transaction (simplified):
Processor Memory
CLK
R/W
A0..15
D0..15
Observation: Access (and addressing) happens
in bus-width-size “chunks”.
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
How does computer memory work?
One (reading) memory transaction (simplified):
Processor Memory
CLK
R/W
A0..15
D0..15
Observation: Access (and addressing) happens
in bus-width-size “chunks”.
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
How does computer memory work?
One (reading) memory transaction (simplified):
Processor Memory
CLK
R/W
A0..15
D0..15
Observation: Access (and addressing) happens
in bus-width-size “chunks”.
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
How does computer memory work?
One (reading) memory transaction (simplified):
Processor Memory
CLK
R/W
A0..15
D0..15
Observation: Access (and addressing) happens
in bus-width-size “chunks”.
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
How does computer memory work?
One (reading) memory transaction (simplified):
Processor Memory
CLK
R/W
A0..15
D0..15
Observation: Access (and addressing) happens
in bus-width-size “chunks”.
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
How does computer memory work?
One (reading) memory transaction (simplified):
Processor Memory
CLK
R/W
A0..15
D0..15
Observation: Access (and addressing) happens
in bus-width-size “chunks”.
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
How does computer memory work?
One (reading) memory transaction (simplified):
Processor Memory
CLK
R/W
A0..15
D0..15
Observation: Access (and addressing) happens
in bus-width-size “chunks”.
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
What is. . . a Memory Interface?
Memory Interface gets and stores binarywords in off-chip memory.
Smallest granularity: Bus width
Tells outside memory
• “where” through address bus
• “what” through data bus
Computer main memory is “Dynamic RAM”(DRAM): Slow, but small and cheap.
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Architecture
• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
• CPUs to GPUs
A Very Simple Program
int a = 5;int b = 17;int z = a ∗ b;
4: c7 45 f4 05 00 00 00 movl $0x5,−0xc(%rbp)b: c7 45 f8 11 00 00 00 movl $0x11,−0x8(%rbp)12: 8b 45 f4 mov −0xc(%rbp),%eax15: 0f af 45 f8 imul −0x8(%rbp),%eax19: 89 45 fc mov %eax,−0x4(%rbp)1c: 8b 45 fc mov −0x4(%rbp),%eax
Things to know:
• Addressing modes (Immediate, Register, Base plus Offset)
• 0xHexadecimal
• “AT&T Form”: (we’ll use this)<opcode><size> <source>, <dest>
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
A Very Simple Program: Intel Form
4: c7 45 f4 05 00 00 00 mov DWORD PTR [rbp−0xc],0x5
b: c7 45 f8 11 00 00 00 mov DWORD PTR [rbp−0x8],0x11
12: 8b 45 f4 mov eax,DWORD PTR [rbp−0xc]
15: 0f af 45 f8 imul eax,DWORD PTR [rbp−0x8]
19: 89 45 fc mov DWORD PTR [rbp−0x4],eax
1c: 8b 45 fc mov eax,DWORD PTR [rbp−0x4]
• “Intel Form”: (you might see this on the net)<opcode> <sized dest>, <sized source>
• Goal: Reading comprehension.
• Don’t understand an opcode?Google “<opcode> intel instruction”.
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Machine Language Loops
int main(){int y = 0, i ;for ( i = 0;
y < 10; ++i)y += i;
return y;}
0: 55 push %rbp1: 48 89 e5 mov %rsp,%rbp4: c7 45 f8 00 00 00 00 movl $0x0,−0x8(%rbp)b: c7 45 fc 00 00 00 00 movl $0x0,−0x4(%rbp)12: eb 0a jmp 1e <main+0x1e>14: 8b 45 fc mov −0x4(%rbp),%eax17: 01 45 f8 add %eax,−0x8(%rbp)1a: 83 45 fc 01 addl $0x1,−0x4(%rbp)1e: 83 7d f8 09 cmpl $0x9,−0x8(%rbp)22: 7e f0 jle 14 <main+0x14>24: 8b 45 f8 mov −0x8(%rbp),%eax27: c9 leaveq28: c3 retq
Things to know:
• Condition Codes (Flags): Zero, Sign, Carry, etc.
• Call Stack: Stack frame, stack pointer, base pointer
• ABI: Calling conventions
Want to make those yourself?Write myprogram.c.$ cc -c myprogram.c
$ objdump --disassemble myprogram.o
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Machine Language Loops
int main(){int y = 0, i ;for ( i = 0;
y < 10; ++i)y += i;
return y;}
0: 55 push %rbp1: 48 89 e5 mov %rsp,%rbp4: c7 45 f8 00 00 00 00 movl $0x0,−0x8(%rbp)b: c7 45 fc 00 00 00 00 movl $0x0,−0x4(%rbp)12: eb 0a jmp 1e <main+0x1e>14: 8b 45 fc mov −0x4(%rbp),%eax17: 01 45 f8 add %eax,−0x8(%rbp)1a: 83 45 fc 01 addl $0x1,−0x4(%rbp)1e: 83 7d f8 09 cmpl $0x9,−0x8(%rbp)22: 7e f0 jle 14 <main+0x14>24: 8b 45 f8 mov −0x8(%rbp),%eax27: c9 leaveq28: c3 retq
Things to know:
• Condition Codes (Flags): Zero, Sign, Carry, etc.
• Call Stack: Stack frame, stack pointer, base pointer
• ABI: Calling conventions
Want to make those yourself?Write myprogram.c.$ cc -c myprogram.c
$ objdump --disassemble myprogram.o
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
We know how a computer works!
All of this can be built in about 4000 transistors.
(e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)
So what exactly is Intel doing with the other 581,996,000
transistors?
Answer:
Make things go faster!
Goal now:Understand sources of slowness, and how they get addressed.
Remember: High Performance Computing
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
We know how a computer works!
All of this can be built in about 4000 transistors.
(e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)
So what exactly is Intel doing with the other 581,996,000
transistors?
Answer: Make things go faster!
Goal now:Understand sources of slowness, and how they get addressed.
Remember: High Performance Computing
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
We know how a computer works!
All of this can be built in about 4000 transistors.
(e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)
So what exactly is Intel doing with the other 581,996,000
transistors?
Answer: Make things go faster!
Goal now:Understand sources of slowness, and how they get addressed.
Remember: High Performance Computing
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
The High-Performance Mindset
Writing high-performance Codes
Mindset: What is going to be the limitingfactor?
• ALU?
• Memory?
• Communication? (if multi-machine)
Benchmark the assumed limiting factor rightaway.
Evaluate
• Know your peak throughputs (roughly)
• Are you getting close?
• Are you tracking the right limiting factor?
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Architecture
• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
• CPUs to GPUs
Source of Slowness: MemoryMemory is slow.
Distinguish two different versions of “slow”:• Bandwidth• Latency
→ Memory has long latency, but can have large bandwidth.
Size of die vs. distance to memory: big!
Dynamic RAM: long intrinsic latency!
Idea:
Put a look-up table ofrecently-used data ontothe chip.
→ “Cache”
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Source of Slowness: MemoryMemory is slow.
Distinguish two different versions of “slow”:• Bandwidth• Latency
→ Memory has long latency, but can have large bandwidth.
Size of die vs. distance to memory: big!
Dynamic RAM: long intrinsic latency!
Idea:
Put a look-up table ofrecently-used data ontothe chip.
→ “Cache”
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
The Memory Hierarchy
Hierarchy of increasingly bigger, slower memories:
Registers
L1 Cache
L2 Cache
DRAM
Virtual Memory
(hard drive)
1 kB, 1 cycle
10 kB, 10 cycles
1 MB, 100 cycles
1 GB, 1000 cycles
1 TB, 1 M cycles
How might data localityfactor into this?
What is a working set?
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
faster
bigger
Enti
re p
rob
lem
fit
s w
ith
in r
egis
ters
Enti
re p
rob
lem
fit
s w
ith
in c
ach
e
Enti
re p
rob
lem
fits
wit
hin
mai
n m
emo
ry
Pro
ble
mre
qu
ires
seco
nd
ary
(dis
k)m
emo
ry
Problem too bigfor system!P
erf
orm
an
ce o
f co
mp
ute
r sy
ste
m
Size of problem being solved
Pe
rfo
rma
nce
of
com
pu
ter
syst
em
Size of problem being solved
Figure 6: Hypothetical model of performance of a computer having a hierarchy of
memory systems (registers, cache, main memory, and disk).
15
from Scott et al. “Scientific Parallel Computing” (2005)
Impact on Performance
The Memory Hierarchy
Hierarchy of increasingly bigger, slower memories:
Registers
L1 Cache
L2 Cache
DRAM
Virtual Memory
(hard drive)
1 kB, 1 cycle
10 kB, 10 cycles
1 MB, 100 cycles
1 GB, 1000 cycles
1 TB, 1 M cyclesHow might data localityfactor into this?
What is a working set?
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Cache: Actual Implementation
Demands on cache implementation:
• Fast, small, cheap, low-power
• Fine-grained
• High “hit”-rate (few “misses”)
Problem:Goals at odds with each other: Access matching logic expensive!
Solution 1: More data per unit of access matching logic→ Larger “Cache Lines”
Solution 2: Simpler/less access matching logic→ Less than full “Associativity”
Other choices: Eviction strategy, size
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Cache: Associativity
Direct Mapped
Memory
0
1
2
3
4
5
6...
Cache
0
1
2
3
2-way set associative
Memory
0
1
2
3
4
5
6...
Cache
0
1
2
3
Miss rate versus cache size on the Integer por-
tion of SPEC CPU2000 [Cantin, Hill 2003]
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Cache: Associativity
Direct Mapped
Memory
0
1
2
3
4
5
6...
Cache
0
1
2
3
2-way set associative
Memory
0
1
2
3
4
5
6...
Cache
0
1
2
3
Miss rate versus cache size on the Integer por-
tion of SPEC CPU2000 [Cantin, Hill 2003]
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Cache Example: Intel Q6600/Core2 Quad
--- L1 data cache ---fully associative cache = falsethreads sharing this cache = 0x0 (0)processor cores on this die= 0x3 (3)system coherency line size = 0x3f (63)ways of associativity = 0x7 (7)number of sets - 1 (s) = 63
--- L1 instruction ---fully associative cache = falsethreads sharing this cache = 0x0 (0)processor cores on this die= 0x3 (3)system coherency line size = 0x3f (63)ways of associativity = 0x7 (7)number of sets - 1 (s) = 63
--- L2 unified cache ---fully associative cache falsethreads sharing this cache = 0x1 (1)processor cores on this die= 0x3 (3)system coherency line size = 0x3f (63)ways of associativity = 0xf (15)number of sets - 1 (s) = 4095
More than you care to know about your CPU:http://www.etallen.com/cpuid.html
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Measuring the Cache I
void go(unsigned count, unsigned stride){const unsigned arr size = 64 ∗ 1024 ∗ 1024;int ∗ary = (int ∗) malloc(sizeof( int) ∗ arr size );
for (unsigned it = 0; it < count; ++it){for (unsigned i = 0; i < arr size ; i += stride)ary [ i ] ∗= 17;
}
free (ary );}
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Measuring the Cache I
void go(unsigned count, unsigned stride){const unsigned arr size = 64 ∗ 1024 ∗ 1024;int ∗ary = (int ∗) malloc(sizeof( int) ∗ arr size );
for (unsigned it = 0; it < count; ++it){for (unsigned i = 0; i < arr size ; i += stride)ary [ i ] ∗= 17;
}
free (ary );}
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Measuring the Cache II
void go(unsigned array size , unsigned steps){int ∗ary = (int ∗) malloc(sizeof( int) ∗ array size );unsigned asm1 = array size − 1;
for (unsigned i = 0; i < steps; ++i)ary [( i∗16) & asm1] ++;
free (ary );}
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Measuring the Cache II
void go(unsigned array size , unsigned steps){int ∗ary = (int ∗) malloc(sizeof( int) ∗ array size );unsigned asm1 = array size − 1;
for (unsigned i = 0; i < steps; ++i)ary [( i∗16) & asm1] ++;
free (ary );}
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Measuring the Cache III
void go(unsigned array size , unsigned stride , unsigned steps){char ∗ary = (char ∗) malloc(sizeof( int) ∗ array size );
unsigned p = 0;for (unsigned i = 0; i < steps; ++i){ary [p] ++;p += stride;if (p >= array size)p = 0;
}
free (ary );}
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Measuring the Cache III
void go(unsigned array size , unsigned stride , unsigned steps){char ∗ary = (char ∗) malloc(sizeof( int) ∗ array size );
unsigned p = 0;for (unsigned i = 0; i < steps; ++i){ary [p] ++;p += stride;if (p >= array size)p = 0;
}
free (ary );}
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Mike Bauer (Stanford)
Tue 4/5/11: Guest Lecture by Mike Bauer (Stanford)
http://sequoia.stanford.edu/
Architecture
• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
• CPUs to GPUs
Source of Slowness: Sequential Operation
IF Instruction fetch
ID Instruction Decode
EX Execution
MEM Memory Read/Write
WB Result Writeback
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Solution: Pipelining
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Pipelining
(MIPS, 110,000 transistors)
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Issues with Pipelines
Pipelines generally helpperformance–but not always.
Possible issues:
• Stalls
• Dependent Instructions
• Branches (+Prediction)
• Self-Modifying Code
“Solution”: Bubbling, extracircuitry
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Intel Q6600 Pipeline
New concept:Instruction-levelparallelism(“Superscalar”)
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Intel Q6600 PipelineNew concept:Instruction-levelparallelism(“Superscalar”)
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Programming for the Pipeline
How to upset a processor pipeline:
for ( int i = 0; i < 1000; ++i)
for ( int j = 0; j < 1000; ++j)
{if ( j % 2 == 0)
do something(i , j );
}
. . . why is this bad?
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
A Puzzle
int steps = 256 ∗ 1024 ∗ 1024;int [] a = new int[2];
// Loop 1for ( int i=0; i<steps; i++) { a[0]++; a[0]++; }
// Loop 2for ( int i=0; i<steps; i++) { a[0]++; a[1]++; }
Which is faster?
. . . and why?
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Two useful Strategies
Loop unrolling:
for ( int i = 0; i < 1000; ++i)do something(i );
→for ( int i = 0; i < 500; i+=2){do something(i );do something(i+1);
}
Software pipelining:
for ( int i = 0; i < 1000; ++i){do a( i );do b(i );
}
→
for ( int i = 0; i < 500; i+=2){do a( i );do a( i+1);do b(i );do b(i+1);
}
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
SIMD
Control Units are large and expensive.
Functional Units are simple and cheap.
→ Increase the Function/Control ratio:
Control several functional units with
one control unit.
All execute same operation.
Dat
a Po
ol
Instruction PoolSIMD
GCC vector extensions:
typedef int v4si attribute (( vector size (16)));
v4si a, b, c;
c = a + b;
// +, −, ∗, /, unary minus, ˆ, |, &, ˜, %
Will revisit for OpenCL, GPUs.
Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)
Architecture
• What’s in a (basic) computer?
• Basic Subsystems
• Machine Language
• Memory Hierarchy
• Pipelines
• CPUs to GPUs
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
EFG$F/$$0
&
! !"#$%&$'()*(+,-.'/
!012('&.*2(3'45&*)&6,7'&"2'89':&%;<=&;>6?&;*2(4'& !012('&.*2(3'45&*)&6,7'&"2'89':&%;<=&;>6?&;*2(4'&
! 6'401-'@&)*(&+,3AB0-3'-407':&C,(,DD'D&
C(*8D'+4/
! E*('&3(,-4043*(4&@'@0.,3'@&3*&?">&3A,-&)D*F&
.*-3(*D&,-@&@,3,&.,.A'
! GA,3&,('&3A'&.*-4'H2'-.'4I
! GA,3&,('&3A'&.*-4'H2'-.'4I
! $(*1(,+&+243&8'&+*('&C('@0.3,8D'/
! 6,3,&,..'44&.*A'('-.5
! $(*1(,+&)D*F
GPUs ?
Intro PyOpenCL What and Why? OpenCL
“CPU-style” Cores
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
CPU-“style” cores
ALU (Execute)
Fetch/ Decode
Execution Context
Out-of-order control logic
Fancy branch predictor
Memory pre-fetcher
Data cache (A big one)
13
Credit: Kayvon Fatahalian (Stanford)
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL What and Why? OpenCL
Slimming down
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Slimming down
ALU (Execute)
Fetch/ Decode
Execution Context
Idea #1:
Remove components that help a single instruction stream run fast
14
Credit: Kayvon Fatahalian (Stanford)
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
More Space: Double the Number of Cores
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Two cores (two fragments in parallel)
ALU (Execute)
Fetch/ Decode
Execution Context
ALU (Execute)
Fetch/ Decode
Execution Context
!"#$$%&'()*"'+,-.
&*/01'.+23.453.623.&2.
/%1..+73.423.892:2;.
/*"".+73.4<3.892:<;3.+7.
/*"".+73.4=3.892:=;3.+7.
81/0.+73.+73.1>[email protected]><?2@.
/%1..A23.+23.+7.
/%1..A<3.+<3.+7.
/%1..A=3.+=3.+7.
/A4..A73.1><?2@.
fragment 1
!"#$$%&'()*"'+,-.
&*/01'.+23.453.623.&2.
/%1..+73.423.892:2;.
/*"".+73.4<3.892:<;3.+7.
/*"".+73.4=3.892:=;3.+7.
81/0.+73.+73.1>[email protected]><?2@.
/%1..A23.+23.+7.
/%1..A<3.+<3.+7.
/%1..A=3.+=3.+7.
/A4..A73.1><?2@.
fragment 2
15
Credit: Kayvon Fatahalian (Stanford)
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
. . . again
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Four cores (four fragments in parallel)
ALU (Execute)
Fetch/ Decode
Execution Context
ALU (Execute)
Fetch/ Decode
Execution Context
ALU (Execute)
Fetch/ Decode
Execution Context
ALU (Execute)
Fetch/ Decode
Execution Context
16
Credit: Kayvon Fatahalian (Stanford)
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
. . . and again
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Sixteen cores (sixteen fragments in parallel)
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
16 cores = 16 simultaneous instruction streams 17 Credit: Kayvon Fatahalian (Stanford)
→ 16 independent instruction streams
Reality: instruction streams not actuallyvery different/independent
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
. . . and again
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Sixteen cores (sixteen fragments in parallel)
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
ALU ALU
16 cores = 16 simultaneous instruction streams 17 Credit: Kayvon Fatahalian (Stanford)
→ 16 independent instruction streams
Reality: instruction streams not actuallyvery different/independent
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Saving Yet More Space
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Recall: simple processing core
Fetch/ Decode
ALU (Execute)
Execution Context
19 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Recall: simple processing core
Fetch/ Decode
ALU (Execute)
Execution Context
19
Idea #2
Amortize cost/complexity ofmanaging an instruction streamacross many ALUs
→ SIMD
Credit: Kayvon Fatahalian (Stanford)
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Saving Yet More Space
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Recall: simple processing core
Fetch/ Decode
ALU (Execute)
Execution Context
19 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Recall: simple processing core
Fetch/ Decode
ALU (Execute)
Execution Context
19
Idea #2
Amortize cost/complexity ofmanaging an instruction streamacross many ALUs
→ SIMD
Credit: Kayvon Fatahalian (Stanford)
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Saving Yet More Space
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Add ALUs
Fetch/ Decode
Idea #2:
Amortize cost/complexity of managing an instruction stream across many ALUs
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
SIMD processing Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
20 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Recall: simple processing core
Fetch/ Decode
ALU (Execute)
Execution Context
19
Idea #2
Amortize cost/complexity ofmanaging an instruction streamacross many ALUs
→ SIMD
Credit: Kayvon Fatahalian (Stanford)
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Saving Yet More Space
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Add ALUs
Fetch/ Decode
Idea #2:
Amortize cost/complexity of managing an instruction stream across many ALUs
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
SIMD processing Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
20 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Add ALUs
Fetch/ Decode
Idea #2:
Amortize cost/complexity of managing an instruction stream across many ALUs
ALU 1 ALU 2 ALU 3 ALU 4
ALU 5 ALU 6 ALU 7 ALU 8
SIMD processing Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
20
Idea #2
Amortize cost/complexity ofmanaging an instruction streamacross many ALUs
→ SIMD
Credit: Kayvon Fatahalian (Stanford)
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Gratuitous Amounts of Parallelism!
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
128 fragments in parallel
= 16 simultaneous instruction streams 16 cores = 128 ALUs
24 Credit: Kayvon Fatahalian (Stanford)
Example:
128 instruction streams in parallel16 independent groups of 8 synchronized streams
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
http://www.youtube.com/watch?v=1yH_j8-VVLo
Intro PyOpenCL What and Why? OpenCL
Gratuitous Amounts of Parallelism!
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
128 fragments in parallel
= 16 simultaneous instruction streams 16 cores = 128 ALUs
24 Credit: Kayvon Fatahalian (Stanford)
Example:
128 instruction streams in parallel16 independent groups of 8 synchronized streams
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
http://www.youtube.com/watch?v=1yH_j8-VVLo
slide by
Intro PyOpenCL What and Why? OpenCL
Remaining Problem: Slow Memory
Problem
Memory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.
We’ve removed
caches
branch prediction
out-of-order execution
So what now?
Idea #3
Even more parallelism+ Some extra memory= A solution!
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Remaining Problem: Slow Memory
Problem
Memory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.
We’ve removed
caches
branch prediction
out-of-order execution
So what now?
Idea #3
Even more parallelism+ Some extra memory= A solution!
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Remaining Problem: Slow Memory
Problem
Memory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.
We’ve removed
caches
branch prediction
out-of-order execution
So what now?SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Hiding shader stalls Time
(clocks) Frag 1 … 8
Fetch/ Decode
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
ALU ALU ALU ALU
ALU ALU ALU ALU
33
Idea #3
Even more parallelism+ Some extra memory= A solution!
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Remaining Problem: Slow Memory
Problem
Memory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.
We’ve removed
caches
branch prediction
out-of-order execution
So what now?SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Hiding shader stalls Time
(clocks)
Fetch/ Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
1 2
3 4
1 2 3 4
Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32
34
Idea #3
Even more parallelism+ Some extra memory= A solution!
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
GPU Architecture Summary
Core Ideas:
1 Many slimmed down cores→ lots of parallelism
2 More ALUs, Fewer Control Units
3 Avoid memory stalls by interleavingexecution of SIMD groups(“warps”)
Credit: Kayvon Fatahalian (Stanford)
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
EFG$F/$$0
&
! !"#$%&$'()*(+,-.'/
!012('&.*2(3'45&*)&6,7'&"2'89':&%;<=&;>6?&;*2(4'& !012('&.*2(3'45&*)&6,7'&"2'89':&%;<=&;>6?&;*2(4'&
! 6'401-'@&)*(&+,3AB0-3'-407':&C,(,DD'D&
C(*8D'+4/
! E*('&3(,-4043*(4&@'@0.,3'@&3*&?">&3A,-&)D*F&
.*-3(*D&,-@&@,3,&.,.A'
! GA,3&,('&3A'&.*-4'H2'-.'4I
! GA,3&,('&3A'&.*-4'H2'-.'4I
! $(*1(,+&+243&8'&+*('&C('@0.3,8D'/
! 6,3,&,..'44&.*A'('-.5
! $(*1(,+&)D*F
slide by Matthew Bolitho
Is it free?
Some terminology
Some More TerminologyOne way to classify machines distinguishes betweenshared memory global memory can be acessed by all processors orcores. Information exchanged between threads using shared variableswritten by one thread and read by another. Need to coordinate access toshared variables.distributed memory private memory for each processor, only accessiblethis processor, so no synchronization for memory accesses needed.Information exchanged by sending data from one processor to anothervia an interconnection network using explicit communication operations.
Interconnection Network
P PP
M M M
Interconnection Network
P PP
M M M
Hybrid approach increasingly common
Some More TerminologyOne way to classify machines distinguishes betweenshared memory global memory can be acessed by all processors orcores. Information exchanged between threads using shared variableswritten by one thread and read by another. Need to coordinate access toshared variables.distributed memory private memory for each processor, only accessiblethis processor, so no synchronization for memory accesses needed.Information exchanged by sending data from one processor to anothervia an interconnection network using explicit communication operations.
Interconnection Network
P PP
M M M
Interconnection Network
P PP
M M M
Hybrid approach increasingly commonnow: mostly hybrid
“distributed memory” “shared memory”
Some terminologySome More Terminology
One way to classify machines distinguishes betweenshared memory global memory can be acessed by all processors orcores. Information exchanged between threads using shared variableswritten by one thread and read by another. Need to coordinate access toshared variables.distributed memory private memory for each processor, only accessiblethis processor, so no synchronization for memory accesses needed.Information exchanged by sending data from one processor to anothervia an interconnection network using explicit communication operations.
Interconnection Network
P PP
M M M
Interconnection Network
P PP
M M M
Hybrid approach increasingly common
Some More TerminologyOne way to classify machines distinguishes betweenshared memory global memory can be acessed by all processors orcores. Information exchanged between threads using shared variableswritten by one thread and read by another. Need to coordinate access toshared variables.distributed memory private memory for each processor, only accessiblethis processor, so no synchronization for memory accesses needed.Information exchanged by sending data from one processor to anothervia an interconnection network using explicit communication operations.
Interconnection Network
P PP
M M M
Interconnection Network
P PP
M M M
Hybrid approach increasingly common
Programming Model(Overview)
GPU Architecture
CUDA Programming Model
Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Who cares ho
w
manycore
s?
Idea:
Program as if there were“infinitely” many cores
Program as if there were“infinitely” many ALUs percore
Consider: Which is easy to do automatically?
Parallel program → sequential hardware
or
Sequential program → parallel hardware?
Axis 0
Axis1
HardwareSoftware representation
?
Really: Group providespool of parallelism to drawfrom.
X,Y,Z order within groupmatters. (Not amonggroups, though.)
Grids can be 1,2,3-dimensional.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Who cares ho
w
manycore
s?
Idea:
Program as if there were“infinitely” many cores
Program as if there were“infinitely” many ALUs percore
Consider: Which is easy to do automatically?
Parallel program → sequential hardware
or
Sequential program → parallel hardware?
Axis 0
Axis1
HardwareSoftware representation
?
Really: Group providespool of parallelism to drawfrom.
X,Y,Z order within groupmatters. (Not amonggroups, though.)
Grids can be 1,2,3-dimensional.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Who cares ho
w
manycore
s?
Idea:
Program as if there were“infinitely” many cores
Program as if there were“infinitely” many ALUs percore
Consider: Which is easy to do automatically?
Parallel program → sequential hardware
or
Sequential program → parallel hardware?
Axis 0
Axis1
HardwareSoftware representation
?
Really: Group providespool of parallelism to drawfrom.
X,Y,Z order within groupmatters. (Not amonggroups, though.)
Grids can be 1,2,3-dimensional.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Who cares ho
w
manycore
s?
Idea:
Program as if there were“infinitely” many cores
Program as if there were“infinitely” many ALUs percore
Consider: Which is easy to do automatically?
Parallel program → sequential hardware
or
Sequential program → parallel hardware?
Axis 0
Axis1
HardwareSoftware representation
?
Really: Group providespool of parallelism to drawfrom.
X,Y,Z order within groupmatters. (Not amonggroups, though.)
Grids can be 1,2,3-dimensional.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Who cares ho
w
manycore
s?
Idea:
Program as if there were“infinitely” many cores
Program as if there were“infinitely” many ALUs percore
Consider: Which is easy to do automatically?
Parallel program → sequential hardware
or
Sequential program → parallel hardware?
Axis 0
Axis1
HardwareSoftware representation
Grid
(Kernel: Func-
tion on Grid)
(Work) Group
(Work) Item
?
Really: Group providespool of parallelism to drawfrom.
X,Y,Z order within groupmatters. (Not amonggroups, though.)
Grids can be 1,2,3-dimensional.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
or “Block”
or “Thread”
Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Who cares ho
w
manycore
s?
Idea:
Program as if there were“infinitely” many cores
Program as if there were“infinitely” many ALUs percore
Consider: Which is easy to do automatically?
Parallel program → sequential hardware
or
Sequential program → parallel hardware?
Axis 0
Axis1
HardwareSoftware representation
Grid
(Kernel: Func-
tion on Grid)
?
Really: Group providespool of parallelism to drawfrom.
X,Y,Z order within groupmatters. (Not amonggroups, though.)
Grids can be 1,2,3-dimensional.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Who cares ho
w
manycore
s?
Idea:
Program as if there were“infinitely” many cores
Program as if there were“infinitely” many ALUs percore
Consider: Which is easy to do automatically?
Parallel program → sequential hardware
or
Sequential program → parallel hardware?
Axis 0
Axis1
HardwareSoftware representation
Grid
(Kernel: Func-
tion on Grid)
(Work) Group
?
Really: Group providespool of parallelism to drawfrom.
X,Y,Z order within groupmatters. (Not amonggroups, though.)
Grids can be 1,2,3-dimensional.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
or “Block”
Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Who cares ho
w
manycore
s?
Idea:
Program as if there were“infinitely” many cores
Program as if there were“infinitely” many ALUs percore
Consider: Which is easy to do automatically?
Parallel program → sequential hardware
or
Sequential program → parallel hardware?
Axis 0
Axis1
HardwareSoftware representation
?
Really: Group providespool of parallelism to drawfrom.
X,Y,Z order within groupmatters. (Not amonggroups, though.)
Grids can be 1,2,3-dimensional.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Who cares ho
w
manycore
s?
Idea:
Program as if there were“infinitely” many cores
Program as if there were“infinitely” many ALUs percore
Consider: Which is easy to do automatically?
Parallel program → sequential hardware
or
Sequential program → parallel hardware?
Axis 0
Axis1
HardwareSoftware representation
?
Really: Group providespool of parallelism to drawfrom.
X,Y,Z order within groupmatters. (Not amonggroups, though.)
Grids can be 1,2,3-dimensional.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Who cares ho
w
manycore
s?
Idea:
Program as if there were“infinitely” many cores
Program as if there were“infinitely” many ALUs percore
Consider: Which is easy to do automatically?
Parallel program → sequential hardware
or
Sequential program → parallel hardware?
Axis 0
Axis1
HardwareSoftware representation
?
Really: Group providespool of parallelism to drawfrom.
X,Y,Z order within groupmatters. (Not amonggroups, though.)
Grids can be 1,2,3-dimensional.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Who cares ho
w
manycore
s?
Idea:
Program as if there were“infinitely” many cores
Program as if there were“infinitely” many ALUs percore
Consider: Which is easy to do automatically?
Parallel program → sequential hardware
or
Sequential program → parallel hardware?
Axis 0
Axis1
HardwareSoftware representation
?
Really: Group providespool of parallelism to drawfrom.
X,Y,Z order within groupmatters. (Not amonggroups, though.)
Grids can be 1,2,3-dimensional.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Who cares ho
w
manycore
s?
Idea:
Program as if there were“infinitely” many cores
Program as if there were“infinitely” many ALUs percore
Consider: Which is easy to do automatically?
Parallel program → sequential hardware
or
Sequential program → parallel hardware?
Axis 0
Axis1
HardwareSoftware representation
?
Really: Group providespool of parallelism to drawfrom.
X,Y,Z order within groupmatters. (Not amonggroups, though.)
Grids can be 1,2,3-dimensional.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Who cares ho
w
manycore
s?
Idea:
Program as if there were“infinitely” many cores
Program as if there were“infinitely” many ALUs percore
Consider: Which is easy to do automatically?
Parallel program → sequential hardware
or
Sequential program → parallel hardware?
Axis 0
Axis1
HardwareSoftware representation
?
Really: Group providespool of parallelism to drawfrom.
X,Y,Z order within groupmatters. (Not amonggroups, though.)
Grids can be 1,2,3-dimensional.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Who cares ho
w
manycore
s?
Idea:
Program as if there were“infinitely” many cores
Program as if there were“infinitely” many ALUs percore
Consider: Which is easy to do automatically?
Parallel program → sequential hardware
or
Sequential program → parallel hardware?
Axis 0
Axis1
HardwareSoftware representation
?
Really: Group providespool of parallelism to drawfrom.
X,Y,Z order within groupmatters. (Not amonggroups, though.)
Grids can be 1,2,3-dimensional.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Who cares ho
w
manycore
s?
Idea:
Program as if there were“infinitely” many cores
Program as if there were“infinitely” many ALUs percore
Consider: Which is easy to do automatically?
Parallel program → sequential hardware
or
Sequential program → parallel hardware?
Axis 0
Axis1
HardwareSoftware representation
?
Really: Group providespool of parallelism to drawfrom.
X,Y,Z order within groupmatters. (Not amonggroups, though.)
Grids can be 1,2,3-dimensional.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Who cares ho
w
manycore
s?
Idea:
Program as if there were“infinitely” many cores
Program as if there were“infinitely” many ALUs percore
Consider: Which is easy to do automatically?
Parallel program → sequential hardware
or
Sequential program → parallel hardware?
Axis 0
Axis1
HardwareSoftware representation
?
Really: Group providespool of parallelism to drawfrom.
X,Y,Z order within groupmatters. (Not amonggroups, though.)
Grids can be 1,2,3-dimensional.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Who cares ho
w
manycore
s?
Idea:
Program as if there were“infinitely” many cores
Program as if there were“infinitely” many ALUs percore
Consider: Which is easy to do automatically?
Parallel program → sequential hardware
or
Sequential program → parallel hardware?
Axis 0
Axis1
HardwareSoftware representation
?
Really: Group providespool of parallelism to drawfrom.
X,Y,Z order within groupmatters. (Not amonggroups, though.)
Grids can be 1,2,3-dimensional.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
Block
block
Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Fetch/Decode
32 kiB CtxPrivate
(“Registers”)
16 kiB CtxShared
Who cares ho
w
manycore
s?
Idea:
Program as if there were“infinitely” many cores
Program as if there were“infinitely” many ALUs percore
Consider: Which is easy to do automatically?
Parallel program → sequential hardware
or
Sequential program → parallel hardware?
Axis 0
Axis1
HardwareSoftware representation
?
Really: Group providespool of parallelism to drawfrom.
X,Y,Z order within groupmatters. (Not amonggroups, though.)
Grids can be 1,2,3-dimensional.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by
more next time ;-)
Bits of Theory(or “common sense”)
Speedup
• T(1): Performance of best serial algorithm
• p: Number of processors
•
S(p) =T (1)T (p)
S(p) ≤ p
Peter Arbenz, Andreas Adelmann, ETH Zurich
Efficiency
• Fraction of time for which a processor does useful work
• means
E(p) =S(p)
p=
T (1)pT (p)
E(p) ≤ 1S(p) ≤ p
Peter Arbenz, Andreas Adelmann, ETH Zurich
Amdahl’s Law
Peter Arbenz, Andreas Adelmann, ETH Zurich
• : Fraction of program that is sequential
• Assumes that the non-sequential portion of the program parallelizes optimally
T (p) =�
α +1− α
p
�T (1)
α
Example
• Sequential portion: 10 sec
• Parallel portion: 990 sec
• What is the maximal speedup as ? p→∞
Solution
• Sequential fraction of the code:
• Amdahl’s Law:
• Speedup as
1010 + 990
=1
100= 1%
T (p) =�
0.01 +0.99p
�T (1)
p→∞
S(p) =T (1)T (p)
→ 1α
= 100
Arithmetic Intensity
• : computational Work in floating-point operations
• : number of Memory accesses (read and write)
• Memory access is the critical issue!
Example4.1 Memory effects
Memory access is the critical issue in high-performance computing.
Definition 4.2 The work/memory ratio !WM: number of floating-point operations
divided by number of memory locations referenced (either reads or writes).
A look at a book of mathematical tables tells us that
"
4= 1 ! 1
3+
1
5! 1
7+
1
9! 1
11+
1
13! 1
15+ · · · (4.1)
Slowly converging series good example for studying basic operation of computing
the sum of a series of numbers:
A =N!
i=1
ai. (4.2)
Computation of A in equation (4.2) requires N ! 1 floating-point additions and
involves N + 1 memory locations: one for A and n for the ai’s.
Therefore, work/memory ratio for this algorithm is !WM = (N ! 1)/(N + 1) " 1
for large N .
12
0 5 10 15 20 25 300
5
10
15
20
25
30Speed!up of simple Pi summation
Number of Processors
Sp
ee
d!
up
Figure 9: Hypothetical performance of a parallel implementation of summation:
speed-up.
28
Why?
from Scott et al. “Scientific Parallel Computing” (2005)
0 5 10 15 20 25 300.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1Parallel efficiency of simple Pi summation
Number of Processors
Eff
icie
ncy
Figure 10: Hypothetical performance of a parallel implementation of summation:
efficiency.
31
Why?
from Scott et al. “Scientific Parallel Computing” (2005)
Example
Computationdone here Pathway to memory
Main datastored here
Figure 4: A simple memory model with a computational unit with only a small
amount of local memory (not shown) separated from the main memory by a path-
way with limited bandwidth µ.
Theorem 4.1 Suppose that a given algorithm has a work/memory ratio !WM, and
it is implemented on a system as depicted in Figure 4 with a maximum bandwidth
to memory of µ billion floating-point words per second. Then the maximum
performance that can be achieved is µ!WM GFLOPS.
Theorem 4.1 provides an upper bound on the number of operations per unit time,
by assuming the floating-point operation blocks until data are available to the cpu.
Therefore the cpu cannot proceed faster than the rate data are supplied, and it
might proceed slower.
13
• Q: How many float32 ops / sec maximum ?
• Processing unit can’t be faster than the rate data are supplied, and it might be slower
Bandwidth = 1 Gbyte / sec
Better?
Computationdone here Pathway to memory
Main datastored here
Local data cache here
Local data cache here
Figure 5: A memory model with a large local data cache separated from the main
memory by a pathway with limited bandwidth µ.
The performance of a two-level memory model (as depicted in Figure 5)
consisting of a cache and a main memory can be modeled simplistically as
average cycles
word access=%hits! cache cycles
word access
+ (1 - %hits)! main memory cycles
word access,
(4.3)
where %hits is the fraction of cache hits among all memory references.
Figure 6 indicates the performance of a hypothetical application, depicting a
decrease in performance as a problem increases in size and migrates into ever
slower memory systems. Eventually the problem size reaches a point where it can
not ever be completed for lack of memory.
14
• Yes? In theory... Why?
• No? Why?
Cache Performance
Computationdone here Pathway to memory
Main datastored here
Local data cache here
Local data cache here
Figure 5: A memory model with a large local data cache separated from the main
memory by a pathway with limited bandwidth µ.
The performance of a two-level memory model (as depicted in Figure 5)
consisting of a cache and a main memory can be modeled simplistically as
average cycles
word access=%hits! cache cycles
word access
+ (1 - %hits)! main memory cycles
word access,
(4.3)
where %hits is the fraction of cache hits among all memory references.
Figure 6 indicates the performance of a hypothetical application, depicting a
decrease in performance as a problem increases in size and migrates into ever
slower memory systems. Eventually the problem size reaches a point where it can
not ever be completed for lack of memory.
14
Computationdone here Pathway to memory
Main datastored here
Local data cache here
Local data cache here
Figure 5: A memory model with a large local data cache separated from the main
memory by a pathway with limited bandwidth µ.
The performance of a two-level memory model (as depicted in Figure 5)
consisting of a cache and a main memory can be modeled simplistically as
average cycles
word access=%hits! cache cycles
word access
+ (1 - %hits)! main memory cycles
word access,
(4.3)
where %hits is the fraction of cache hits among all memory references.
Figure 6 indicates the performance of a hypothetical application, depicting a
decrease in performance as a problem increases in size and migrates into ever
slower memory systems. Eventually the problem size reaches a point where it can
not ever be completed for lack of memory.
14
from Scott et al. “Scientific Parallel Computing” (2005)
COMP 322, Fall 2009 (V.Sarkar)!16
Cache Performance
16
!"#"$%$&'"()(*+,%-",.("
!*/*'("#",.("0%1"&"23-4'("51%*(22%1"*/*'("
6*%+-$"#"$%$&'"-+.7(1"%0"3-2$1+*,%-2"689:"#"-+.7(1"%0"89:"3-2$1+*,%-2";(<4<"1(432$(1"="1(432$(1>"
6?@?"#"-+.7(1"%0".(.%1/"&**(22"3-2$1+*,%-2";"(<4<"'%&AB"2$%1(>"
CD6"#"&E(1&4("*/*'(2"5(1"3-2$1+*,%-2"
CD689:"#"&E(1&4("*/*'(2"5(1"89:"3-2$1+*,%-2"
CD6?@?"#"&E(1&4("*/*'(2"5(1".(.%1/"3-2$1+*,%-"
1.322"#"*&*F(".322"1&$("
1F3$"#"*&*F("F3$"1&$("CD6?@?G?6HH"#"*/*'(2"5(1"*&*F(".322"
CD6?@?GI6!#*/*'(2"5(1"*&*F("F3$"
?89:"#"3-2$1+*,%-".3)"0%1"89:"3-2$1+*,%-2"
??@?"#"3-2$1+*,%-".3)"0%1".(.%1/"&**(22"3-2$1+*,%-"
Cache Performance
from V. Sarkar (COMP 322, 2009)
COMP 322, Fall 2009 (V.Sarkar)!17
Cache Performance: Example
17
from V. Sarkar (COMP 322, 2009)
Cache Performance
Parallel Complexity
COMP 322, Fall 2009 (V.Sarkar)!15
Algorithmic Complexity Measures!
TP = execution time on P processors Computation graph abstraction: •! Node = arbitrary sequential computation •! Edge = dependence (successor node can only execute after predecessor node has completed) •! Directed acyclic graph (dag)
Processor abstraction: •! P identical processors •! Each processor executes one node at a time PROC0 PROCP-1 . . .
= execution time on processors
Computation graph abstraction (DAG):Node: arbitrary sequential computationEdge: dependence
Assume: identical processorsexecuting one node at a time
adapted from V. Sarkar (COMP 322, 2009)
Parallel Complexity
COMP 322, Fall 2009 (V.Sarkar)!15
Algorithmic Complexity Measures!
TP = execution time on P processors Computation graph abstraction: •! Node = arbitrary sequential computation •! Edge = dependence (successor node can only execute after predecessor node has completed) •! Directed acyclic graph (dag)
Processor abstraction: •! P identical processors •! Each processor executes one node at a time PROC0 PROCP-1 . . .
= execution time on processors
COMP 322, Fall 2009 (V.Sarkar)!16
Algorithmic Complexity Measures!
TP = execution time on P processors
T1 = work “work complexity”
total number of operations performed
adapted from V. Sarkar (COMP 322, 2009)
Parallel Complexity
COMP 322, Fall 2009 (V.Sarkar)!15
Algorithmic Complexity Measures!
TP = execution time on P processors Computation graph abstraction: •! Node = arbitrary sequential computation •! Edge = dependence (successor node can only execute after predecessor node has completed) •! Directed acyclic graph (dag)
Processor abstraction: •! P identical processors •! Each processor executes one node at a time PROC0 PROCP-1 . . .
= execution time on processors
COMP 322, Fall 2009 (V.Sarkar)!17 July 13, 2006 17
Algorithmic Complexity Measures!
TP = execution time on P processors
T1 = work
T! = span*
*Also called critical-path length or computational depth.
COMP 322, Fall 2009 (V.Sarkar)!17 July 13, 2006 17
Algorithmic Complexity Measures!
TP = execution time on P processors
T1 = work
T! = span*
*Also called critical-path length or computational depth.
* also called:critical path length or computational depth
COMP 322, Fall 2009 (V.Sarkar)!17 July 13, 2006 17
Algorithmic Complexity Measures!
TP = execution time on P processors
T1 = work
T! = span*
*Also called critical-path length or computational depth.
“work complexity”
“step complexity”
minimum number of steps when
adapted from V. Sarkar (COMP 322, 2009)
Parallel Complexity
COMP 322, Fall 2009 (V.Sarkar)!15
Algorithmic Complexity Measures!
TP = execution time on P processors Computation graph abstraction: •! Node = arbitrary sequential computation •! Edge = dependence (successor node can only execute after predecessor node has completed) •! Directed acyclic graph (dag)
Processor abstraction: •! P identical processors •! Each processor executes one node at a time PROC0 PROCP-1 . . .
= execution time on processors
Lower bounds:
adapted from V. Sarkar (COMP 322, 2009)
Parallel Complexity= execution time on processors
Parallelism (i.e ideal speed-up):
COMP 322, Fall 2009 (V.Sarkar)!19
Speedup!
If T1/TP = !(P), we have linear speedup; = P, we have perfect linear speedup; > P, we have superlinear speedup,
Superlinear speedup is not possible in this model because of the lower bound TP ! T1/P, but superlinear speedup can be possible in practice (as we will see later in the course)
COMP 322, Fall 2009 (V.Sarkar)!17 July 13, 2006 17
Algorithmic Complexity Measures!
TP = execution time on P processors
T1 = work
T! = span*
*Also called critical-path length or computational depth.
adapted from V. Sarkar (COMP 322, 2009)
ExampleArray Sum: Sequential Version
COMP 322, Fall 2009 (V.Sarkar)!21
Example 1: Array Sum !(sequential version)"
•! Problem: compute the sum of the elements X[0] … X[n-1] of array X
•! Sequential algorithm —! sum = 0; for ( i=0 ; i< n ; i++ ) sum += X[i];
•! Computation graph
—! Work = O(n), Span = O(n), Parallelism = O(1)
•! How can we design an algorithm (computation graph) with more parallelism?
+
+
+
X[0]
X[1]
X[2]
…
0
COMP 322, Fall 2009 (V.Sarkar)!21
Example 1: Array Sum !(sequential version)"
•! Problem: compute the sum of the elements X[0] … X[n-1] of array X
•! Sequential algorithm —! sum = 0; for ( i=0 ; i< n ; i++ ) sum += X[i];
•! Computation graph
—! Work = O(n), Span = O(n), Parallelism = O(1)
•! How can we design an algorithm (computation graph) with more parallelism?
+
+
+
X[0]
X[1]
X[2]
…
0
adapted from V. Sarkar (COMP 322, 2009)
ExampleArray Sum: Parallel Iterative Version
adapted from V. Sarkar (COMP 322, 2009)
COMP 322, Fall 2009 (V.Sarkar)!23
Example 1: Array Sum !(parallel iterative version)"
•! Computation graph for n = 8
•! Work = O(n), Span = O(log n), Parallelism = O( n / (log n) )
+
X[2] X[3]
+
X[0] X[1]
+
X[4] X[5]
+
X[6] X[7]
X[0] X[2] X[4] X[6]
+ +
X[0]
X[4]
+
X[0]
Extra dependence edges due to forall construct
ExampleArray Sum: Parallel Recursive Version
adapted from V. Sarkar (COMP 322, 2009)
COMP 322, Fall 2009 (V.Sarkar)!25
Example 1: Array Sum !(parallel recursive version)"
•! Computation graph for n = 8
•! Work = O(n), Span = O(log n), Parallelism = O( n / (log n) )
•! No extra dependences as in forall case
+
X[2] X[3]
+
X[0] X[1]
+
X[4] X[5]
+
X[6] X[7]
+ +
+
Patterns
Task vs Data Parallelism
Task parallelism
• Distribute the tasks across processors based on dependency
• Coarse-grain parallelism
157
Task 1Task 2
Task 4Task 5 Task 6
Task 7 Task 8Task 9
Task 3
Task dependency graph
Task assignment across 3 processors
Task 1
Task 4
Task 7
Task 5
Task 8
Task 2
Task 6
Task 3
Task 9
P1
P2
P3
Time
Data parallelism
• Run a single kernel over many elements–Each element is independently updated–Same operation is applied on each element
• Fine-grain parallelism–Many lightweight threads, easy to switch context–Maps well to ALU heavy architecture : GPU
158
Kernel P1 P2 P3 P4 P5 Pn…….
…….Data
4
Task vs. Data parallelismTask vs. Data parallelism
• Task parallel
– Independent processes with little communication
– Easy to use
• “Free” on modern operating systems with SMP
• Data parallel
– Lots of data on which the same computation is being
executed
– No dependencies between data elements in each
step in the computation
– Can saturate many ALUs
– But often requires redesign of traditional algorithms
slide by Mike Houston
5
CPU vs. GPUCPU vs. GPU
• CPU
– Really fast caches (great for data reuse)
– Fine branching granularity
– Lots of different processes/threads
– High performance on a single thread of execution
• GPU
– Lots of math units
– Fast access to onboard memory
– Run a program on each fragment/vertex
– High throughput on parallel tasks
• CPUs are great for task parallelism
• GPUs are great for data parallelismslide by Mike Houston
GPU-friendly Problems
• Data-parallel processing• High arithmetic intensity
–Keep GPU busy all the time–Computation offsets memory latency
• Coherent data access–Access large chunk of contiguous memory–Exploit fast on-chip shared memory
161
The Algorithm Matters
• Jacobi: Parallelizable
for(int i=0; i<num; i++) { vn+1[i] = (vn[i-1] + vn[i+1])/2.0; }
• Gauss-Seidel: Difficult to parallelize
for(int i=0; i<num; i++) { v[i] = (v[i-1] + v[i+1])/2.0; }
162
Example: Reduction
• Serial version (O(N)) for(int i=1; i<N; i++) { v[0] += v[i]; }
• Parallel version (O(logN)) width = N/2; while(width > 1) { for(int i=0; i<width; i++) { v[i] += v[i+width]; // computed in parallel } width /= 2; }
163
6
The Importance of Data Parallelism for GPUsThe Importance of Data Parallelism for GPUs
• GPUs are designed for highly parallel tasks like
rendering
• GPUs process independent vertices and fragments
– Temporary registers are zeroed
– No shared or static data
– No read-modify-write buffers
– In short, no communication between vertices or fragments
• Data-parallel processing
– GPU architectures are ALU-heavy
• Multiple vertex & pixel pipelines
• Lots of compute power
– GPU memory systems are designed to stream data
• Linear access patterns can be prefetched
• Hide memory latency slide by Mike Houston
GPUs
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
F
!"!# !"$#
$"!# $"$#
!%&'() $*(+%,()
!%&'()
$*(+%,()
#-+-
"&.+/*0+%1& !"!# !"$#
$"!# $"$#
!%&'() $*(+%,()
!%&'()
$*(+%,()
#-+-
"&.+/*0+%1&
!"!# !"$#
$"!# $"$#
!%&'() $*(+%,()
!%&'()
$*(+%,()
#-+-
"&.+/*0+%1& !"!# !"$#
$"!# $"$#
!%&'() $*(+%,()
!%&'()
$*(+%,()
#-+-
"&.+/*0+%1&
!"!# !"$#
$"!# $"$#
!%&'() $*(+%,()
!%&'()
$*(+%,()
"&.+/*0+%1&
"&.+/*0+%1&
$"$#
!"#$%&'(%)*$+
!(, -.(/
0123$1453%&'(%)*$+
(,, 67523%$2
8+4$1& 9$1&
slide by Matthew Bolitho
Flynn’s TaxonomyEarly classification of parallel computing architectures given by M.Flynn (1972) using number of instruction streams and data streams.Still used.
• Single Instruction Single Data (SISD) conventional sequentialcomputer with one processor, single program and data storage.
• Multiple Instruction Single Data (MISD) used for fault tolerance(Space Shuttle) - from Wikipedia
• Single Instruction Multiple Data (SIMD) each processing elementuses same instruction applied synchronously in parallel todifferent data elements (Connection Machine, GPUs).If-then-else statements take two steps to execute.
• Multiple Instruction Multiple Data (MIMD) each processingelememt loads separate instrution and separate data elements;processors work asynchronously. Since 2006 top tensupercomputers of this type (w/o 10K node SGI Altix Columbiaat NASA Ames)
Update: Single Program Multiple Data (SPMD) autonomousprocessors executing same program but not in lockstep. Mostcommon style of programming. adapted from Berger & Klöckner (NYU 2010)
Finding Concurrency
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
&
! !"#$%&'()'*$'+&',)($'',%$)-."#)$/.&0/1)
0#$."2'#'3&
! 45)$."$".&0"3)"-)$."6./#)&7/&)0()$/./11'1
! 85)($'',%$)"-)$/./11'1)$".&0"3)
! 9)('.0/1)/16".0&7#)+/3):')#/,')$/./11'1):;)
!"#$"#%&'!"#$%&'()$!*+%,+-..!,+/0
! <03,)-%3,/#'3&/1)$/.&()"-)&7')/16".0&7#)
&7/&)/.')('$/./:1'
slide by Matthew Bolitho
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
#
!"#$%&'()*+)#,-,).
&"-"%&'()*+)#,-,).
/0)1+%!"#$#
203'0%!"#$#
&"-"%45"0,.6
&'()*+)#,-,). &'+'.3'.(7%8."97#,#
!"#$%&'()*+)#,-,).
&"-"%&'()*+)#,-,).
/0)1+%!"#$#
203'0%!"#$#
&"-"%45"0,.6
&'()*+)#,-,). &'+'.3'.(7%8."97#,#
! 896)0,-5*#%(".%:'%3'()*+)#'3%:7%:)-5%-"#$%
".3%3"-";
! !"#$;%<,.3%60)1+#%)=%,.#-01(-,).#%-5"-%(".%:'%
'>'(1-'3%,.%+"0"99'9
! %"&";%<,.3%+"0-,-,).#%,.%-5'%3"-"%-5"-%(".%:'%1#'3%
?0'9"-,@'97A%,.3'+'.3'.-97
!"#$%&'()*+)#,-,).
&"-"%&'()*+)#,-,).
/0)1+%!"#$#
203'0%!"#$#
&"-"%45"0,.6
&'()*+)#,-,). &'+'.3'.(7%8."97#,#
! 8."97B'%-5'%"96)0,-5*%".3%=,.3%!"#$%&'#('
)*&+"$,+)#*& -5"-%"0'%?0'9"-,@'97A%,.3'+'.3'.-
! 8."97B'%-5'%"96)0,-5*%".3%=,.3%!"#$%&'#('
)*&+"$,+)#*& -5"-%"0'%?0'9"-,@'97A%,.3'+'.3'.-
! C6;%D"-0,>%D19-,+9,("-,).
! E)*+1-,.6%'"(5%'9'*'.-%)=%F%,#%"%3)-%+0)31(-
slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
#
!"#$%&'()*+)#,-,).
&"-"%&'()*+)#,-,).
/0)1+%!"#$#
203'0%!"#$#
&"-"%45"0,.6
&'()*+)#,-,). &'+'.3'.(7%8."97#,#
!"#$%&'()*+)#,-,).
&"-"%&'()*+)#,-,).
/0)1+%!"#$#
203'0%!"#$#
&"-"%45"0,.6
&'()*+)#,-,). &'+'.3'.(7%8."97#,#
! 896)0,-5*#%(".%:'%3'()*+)#'3%:7%:)-5%-"#$%
".3%3"-";
! !"#$;%<,.3%60)1+#%)=%,.#-01(-,).#%-5"-%(".%:'%
'>'(1-'3%,.%+"0"99'9
! %"&";%<,.3%+"0-,-,).#%,.%-5'%3"-"%-5"-%(".%:'%1#'3%
?0'9"-,@'97A%,.3'+'.3'.-97
!"#$%&'()*+)#,-,).
&"-"%&'()*+)#,-,).
/0)1+%!"#$#
203'0%!"#$#
&"-"%45"0,.6
&'()*+)#,-,). &'+'.3'.(7%8."97#,#
! 8."97B'%-5'%"96)0,-5*%".3%=,.3%!"#$%&'#('
)*&+"$,+)#*& -5"-%"0'%?0'9"-,@'97A%,.3'+'.3'.-
! 8."97B'%-5'%"96)0,-5*%".3%=,.3%!"#$%&'#('
)*&+"$,+)#*& -5"-%"0'%?0'9"-,@'97A%,.3'+'.3'.-
! C6;%D"-0,>%D19-,+9,("-,).
! E)*+1-,.6%'"(5%'9'*'.-%)=%F%,#%"%3)-%+0)31(-
slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
#
!"#$%&'()*+)#,-,).
&"-"%&'()*+)#,-,).
/0)1+%!"#$#
203'0%!"#$#
&"-"%45"0,.6
&'()*+)#,-,). &'+'.3'.(7%8."97#,#
!"#$%&'()*+)#,-,).
&"-"%&'()*+)#,-,).
/0)1+%!"#$#
203'0%!"#$#
&"-"%45"0,.6
&'()*+)#,-,). &'+'.3'.(7%8."97#,#
! 896)0,-5*#%(".%:'%3'()*+)#'3%:7%:)-5%-"#$%
".3%3"-";
! !"#$;%<,.3%60)1+#%)=%,.#-01(-,).#%-5"-%(".%:'%
'>'(1-'3%,.%+"0"99'9
! %"&";%<,.3%+"0-,-,).#%,.%-5'%3"-"%-5"-%(".%:'%1#'3%
?0'9"-,@'97A%,.3'+'.3'.-97
!"#$%&'()*+)#,-,).
&"-"%&'()*+)#,-,).
/0)1+%!"#$#
203'0%!"#$#
&"-"%45"0,.6
&'()*+)#,-,). &'+'.3'.(7%8."97#,#
! 8."97B'%-5'%"96)0,-5*%".3%=,.3%!"#$%&'#('
)*&+"$,+)#*& -5"-%"0'%?0'9"-,@'97A%,.3'+'.3'.-
! 8."97B'%-5'%"96)0,-5*%".3%=,.3%!"#$%&'#('
)*&+"$,+)#*& -5"-%"0'%?0'9"-,@'97A%,.3'+'.3'.-
! C6;%D"-0,>%D19-,+9,("-,).
! E)*+1-,.6%'"(5%'9'*'.-%)=%F%,#%"%3)-%+0)31(-
slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
'
! !"#$%&'()*'(#$+,-.)*/(#"0(1."0(!"#$%&'#('
)*&+"$,+)#*& )*#)(#-'(2-'$#).3'$%4(."0'5'"0'")
! 6+7(8,$'9:$#-(;%"#/.9<! =,/5:)'>.?-#).,"#$@,-9'<
! =,/5:)'A,)#).,"#$@,-9'<
! =,/5:)';.*'0-#$@,-9'<
! =,/5:)'B'.+*?,:-<
! =,/5:)'B,"C,"0."+@,-9'<
! D50#)'E,<.).,"<!"0>'$,9.).'<
F#<G(;'9,/5,<.).,"
;#)#(;'9,/5,<.).,"
H-,:5(F#<G<
I-0'-(F#<G<
;#)#(J*#-."+
;'9,/5,<.).," ;'5'"0'"9%(!"#$%<.<
! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(
%-"+)+)#*'+./'0-+-
! 6+7(8#)-.L(8:$).5$.9#).,"7(=,$:/"<(#"0(A,K<
1 2
! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(
%-"+)+)#*'+./'0-+-
! 6+7(8#)-.L(8:$).5$.9#).,"7(C$,9G<
1 2
! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(
%-"+)+)#*'+./'0-+-
! 6+7(8#)-.L(8:$).5$.9#).,"
1 2
! @."0(K#%<(),(5#-).).,"()*'(0#)#
! 6+7(8#)-.L(8:$).5$.9#).,"
1 2
slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
'
! !"#$%&'()*'(#$+,-.)*/(#"0(1."0(!"#$%&'#('
)*&+"$,+)#*& )*#)(#-'(2-'$#).3'$%4(."0'5'"0'")
! 6+7(8,$'9:$#-(;%"#/.9<! =,/5:)'>.?-#).,"#$@,-9'<
! =,/5:)'A,)#).,"#$@,-9'<
! =,/5:)';.*'0-#$@,-9'<
! =,/5:)'B'.+*?,:-<
! =,/5:)'B,"C,"0."+@,-9'<
! D50#)'E,<.).,"<!"0>'$,9.).'<
F#<G(;'9,/5,<.).,"
;#)#(;'9,/5,<.).,"
H-,:5(F#<G<
I-0'-(F#<G<
;#)#(J*#-."+
;'9,/5,<.).," ;'5'"0'"9%(!"#$%<.<
! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(
%-"+)+)#*'+./'0-+-
! 6+7(8#)-.L(8:$).5$.9#).,"7(=,$:/"<(#"0(A,K<
1 2
! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(
%-"+)+)#*'+./'0-+-
! 6+7(8#)-.L(8:$).5$.9#).,"7(C$,9G<
1 2
! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(
%-"+)+)#*'+./'0-+-
! 6+7(8#)-.L(8:$).5$.9#).,"
1 2
! @."0(K#%<(),(5#-).).,"()*'(0#)#
! 6+7(8#)-.L(8:$).5$.9#).,"
1 2
slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
'
! !"#$%&'()*'(#$+,-.)*/(#"0(1."0(!"#$%&'#('
)*&+"$,+)#*& )*#)(#-'(2-'$#).3'$%4(."0'5'"0'")
! 6+7(8,$'9:$#-(;%"#/.9<! =,/5:)'>.?-#).,"#$@,-9'<
! =,/5:)'A,)#).,"#$@,-9'<
! =,/5:)';.*'0-#$@,-9'<
! =,/5:)'B'.+*?,:-<
! =,/5:)'B,"C,"0."+@,-9'<
! D50#)'E,<.).,"<!"0>'$,9.).'<
F#<G(;'9,/5,<.).,"
;#)#(;'9,/5,<.).,"
H-,:5(F#<G<
I-0'-(F#<G<
;#)#(J*#-."+
;'9,/5,<.).," ;'5'"0'"9%(!"#$%<.<
! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(
%-"+)+)#*'+./'0-+-
! 6+7(8#)-.L(8:$).5$.9#).,"7(=,$:/"<(#"0(A,K<
1 2
! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(
%-"+)+)#*'+./'0-+-
! 6+7(8#)-.L(8:$).5$.9#).,"7(C$,9G<
1 2
! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(
%-"+)+)#*'+./'0-+-
! 6+7(8#)-.L(8:$).5$.9#).,"
1 2
! @."0(K#%<(),(5#-).).,"()*'(0#)#
! 6+7(8#)-.L(8:$).5$.9#).,"
1 2
slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
'
! !"#$%&'()*'(#$+,-.)*/(#"0(1."0(!"#$%&'#('
)*&+"$,+)#*& )*#)(#-'(2-'$#).3'$%4(."0'5'"0'")
! 6+7(8,$'9:$#-(;%"#/.9<! =,/5:)'>.?-#).,"#$@,-9'<
! =,/5:)'A,)#).,"#$@,-9'<
! =,/5:)';.*'0-#$@,-9'<
! =,/5:)'B'.+*?,:-<
! =,/5:)'B,"C,"0."+@,-9'<
! D50#)'E,<.).,"<!"0>'$,9.).'<
F#<G(;'9,/5,<.).,"
;#)#(;'9,/5,<.).,"
H-,:5(F#<G<
I-0'-(F#<G<
;#)#(J*#-."+
;'9,/5,<.).," ;'5'"0'"9%(!"#$%<.<
! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(
%-"+)+)#*'+./'0-+-
! 6+7(8#)-.L(8:$).5$.9#).,"7(=,$:/"<(#"0(A,K<
1 2
! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(
%-"+)+)#*'+./'0-+-
! 6+7(8#)-.L(8:$).5$.9#).,"7(C$,9G<
1 2
! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(
%-"+)+)#*'+./'0-+-
! 6+7(8#)-.L(8:$).5$.9#).,"
1 2
! @."0(K#%<(),(5#-).).,"()*'(0#)#
! 6+7(8#)-.L(8:$).5$.9#).,"
1 2
slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
0
!"#$%&'()*+)#,-,).
&"-"%&'()*+)#,-,).
/0)1+%!"#$#
203'0%!"#$#
&"-"%45"0,.6
&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! !5'0'%"0'%*".7%:"7#%-)%3'()*+)#'%".7%
6,;'.%"96)0,-5*
! 4)*'-,*'#%3"-"%3'()*+)#'%'"#,97
! 4)*'-,*'#%-"#$#%3'()*+)#'%'"#,97
! 4)*'-,*'#%<)-5=! 4)*'-,*'#%.',-5'0=
!"#$%&'()*+)#,-,).
&"-"%&'()*+)#,-,).
/0)1+%!"#$#
203'0%!"#$#
&"-"%45"0,.6
&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! 2.('%-5'%"96)0,-5*%5"#%<''.%3'()*+)#'3%
,.-)%3"-"%".3%-"#$#>
!8."97?' @.-'0"(-,).#
!"#$%&'()*+)#,-,).
&"-"%&'()*+)#,-,).
/0)1+%!"#$#
203'0%!"#$#
&"-"%45"0,.6
&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! !)%'"#'%-5'%*"."6'*'.-%)A%3'+'.3'.(,'#%
A,.3%-"#$#%-5"-%"0'%#,*,9"0%".3%60)1+%-5'*
! !5'.%"."97?'%().#-0",.-#%-)%3'-'0*,.'%".7%
.'('##"07%)03'0
slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
0
!"#$%&'()*+)#,-,).
&"-"%&'()*+)#,-,).
/0)1+%!"#$#
203'0%!"#$#
&"-"%45"0,.6
&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! !5'0'%"0'%*".7%:"7#%-)%3'()*+)#'%".7%
6,;'.%"96)0,-5*
! 4)*'-,*'#%3"-"%3'()*+)#'%'"#,97
! 4)*'-,*'#%-"#$#%3'()*+)#'%'"#,97
! 4)*'-,*'#%<)-5=! 4)*'-,*'#%.',-5'0=
!"#$%&'()*+)#,-,).
&"-"%&'()*+)#,-,).
/0)1+%!"#$#
203'0%!"#$#
&"-"%45"0,.6
&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! 2.('%-5'%"96)0,-5*%5"#%<''.%3'()*+)#'3%
,.-)%3"-"%".3%-"#$#>
!8."97?' @.-'0"(-,).#
!"#$%&'()*+)#,-,).
&"-"%&'()*+)#,-,).
/0)1+%!"#$#
203'0%!"#$#
&"-"%45"0,.6
&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! !)%'"#'%-5'%*"."6'*'.-%)A%3'+'.3'.(,'#%
A,.3%-"#$#%-5"-%"0'%#,*,9"0%".3%60)1+%-5'*
! !5'.%"."97?'%().#-0",.-#%-)%3'-'0*,.'%".7%
.'('##"07%)03'0
slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
0
!"#$%&'()*+)#,-,).
&"-"%&'()*+)#,-,).
/0)1+%!"#$#
203'0%!"#$#
&"-"%45"0,.6
&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! !5'0'%"0'%*".7%:"7#%-)%3'()*+)#'%".7%
6,;'.%"96)0,-5*
! 4)*'-,*'#%3"-"%3'()*+)#'%'"#,97
! 4)*'-,*'#%-"#$#%3'()*+)#'%'"#,97
! 4)*'-,*'#%<)-5=! 4)*'-,*'#%.',-5'0=
!"#$%&'()*+)#,-,).
&"-"%&'()*+)#,-,).
/0)1+%!"#$#
203'0%!"#$#
&"-"%45"0,.6
&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! 2.('%-5'%"96)0,-5*%5"#%<''.%3'()*+)#'3%
,.-)%3"-"%".3%-"#$#>
!8."97?' @.-'0"(-,).#
!"#$%&'()*+)#,-,).
&"-"%&'()*+)#,-,).
/0)1+%!"#$#
203'0%!"#$#
&"-"%45"0,.6
&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! !)%'"#'%-5'%*"."6'*'.-%)A%3'+'.3'.(,'#%
A,.3%-"#$#%-5"-%"0'%#,*,9"0%".3%60)1+%-5'*
! !5'.%"."97?'%().#-0",.-#%-)%3'-'0*,.'%".7%
.'('##"07%)03'0
slide by Matthew Bolitho
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
0
!"#$%&'()*+)#,-,).
&"-"%&'()*+)#,-,).
/0)1+%!"#$#
203'0%!"#$#
&"-"%45"0,.6
&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! !5'0'%"0'%*".7%:"7#%-)%3'()*+)#'%".7%
6,;'.%"96)0,-5*
! 4)*'-,*'#%3"-"%3'()*+)#'%'"#,97
! 4)*'-,*'#%-"#$#%3'()*+)#'%'"#,97
! 4)*'-,*'#%<)-5=! 4)*'-,*'#%.',-5'0=
!"#$%&'()*+)#,-,).
&"-"%&'()*+)#,-,).
/0)1+%!"#$#
203'0%!"#$#
&"-"%45"0,.6
&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! 2.('%-5'%"96)0,-5*%5"#%<''.%3'()*+)#'3%
,.-)%3"-"%".3%-"#$#>
!8."97?' @.-'0"(-,).#
!"#$%&'()*+)#,-,).
&"-"%&'()*+)#,-,).
/0)1+%!"#$#
203'0%!"#$#
&"-"%45"0,.6
&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! !)%'"#'%-5'%*"."6'*'.-%)A%3'+'.3'.(,'#%
A,.3%-"#$#%-5"-%"0'%#,*,9"0%".3%60)1+%-5'*
! !5'.%"."97?'%().#-0",.-#%-)%3'-'0*,.'%".7%
.'('##"07%)03'0
slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
F
! !"#$%&$#'($#)%*%+$)$*'#",#-$.$*-$*/0$&#,0*-#'%&1&#'(%'#%2$#&0)03%2#%*-#+2"4.#'($)
! 5+6#7"3$/43%2#89*%)0/&! :").4'$;0<2%'0"*%3="2/$&
! :").4'$>"'%'0"*%3="2/$&
! :").4'$80($-2%3="2/$&
! :").4'$?$0+(<"42&
! :").4'$?"*@"*-0*+="2/$&
! A.-%'$B"&0'0"*&C*-;$3"/0'0$&
! !"#$%&$#'($#)%*%+$)$*'#",#-$.$*-$*/0$&#,0*-#'%&1&#'(%'#%2$#&0)03%2#%*-#+2"4.#'($)
! 5+6#7"3$/43%2#89*%)0/&! :").4'$;0<2%'0"*%3="2/$&
! :").4'$>"'%'0"*%3="2/$&
! :").4'$80($-2%3="2/$&
! :").4'$?$0+(<"42&
! :").4'$?"*@"*-0*+="2/$&
! A.-%'$B"&0'0"*&C*-;$3"/0'0$&
! :").4'$#@"*-$-#="2/$&! :").4'$;0<2%'0"*%3="2/$&
! :").4'$>"'%'0"*%3="2/$&
! :").4'$80($-2%3="2/$&
! :").4'$#?$0+(<"42&! :").4'$#?"*D@"*-0*+#="2/$&
! A.-%'$B"&0'0"*&C*-;$3"/0'0$&
! E*/$#+2"4.&#",#'%&1&#%2$#0-$*'0,0$-F#-%'%#,3"G#
/"*&'2%0*'&#$*,"2/$#%#.%2'0%3#"2-$26
! E*/$#+2"4.&#",#'%&1&#%2$#0-$*'0,0$-F#-%'%#,3"G#
/"*&'2%0*'&#$*,"2/$#%#.%2'0%3#"2-$26
A.-%'$#B"&0'0"*&#%*-#;$3"/0'0$&
?"*#@"*-$-#="2/$&
?$0+(<"2#H0&'
@"*-$-#="2/$&
!%&1#8$/")."&0'0"*
8%'%#8$/")."&0'0"*
I2"4.#!%&1&
E2-$2#!%&1&
8%'%#J(%20*+
8$/")."&0'0"* 8$.$*-$*/9#C*%39&0&
slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
F
! !"#$%&$#'($#)%*%+$)$*'#",#-$.$*-$*/0$&#,0*-#'%&1&#'(%'#%2$#&0)03%2#%*-#+2"4.#'($)
! 5+6#7"3$/43%2#89*%)0/&! :").4'$;0<2%'0"*%3="2/$&
! :").4'$>"'%'0"*%3="2/$&
! :").4'$80($-2%3="2/$&
! :").4'$?$0+(<"42&
! :").4'$?"*@"*-0*+="2/$&
! A.-%'$B"&0'0"*&C*-;$3"/0'0$&
! !"#$%&$#'($#)%*%+$)$*'#",#-$.$*-$*/0$&#,0*-#'%&1&#'(%'#%2$#&0)03%2#%*-#+2"4.#'($)
! 5+6#7"3$/43%2#89*%)0/&! :").4'$;0<2%'0"*%3="2/$&
! :").4'$>"'%'0"*%3="2/$&
! :").4'$80($-2%3="2/$&
! :").4'$?$0+(<"42&
! :").4'$?"*@"*-0*+="2/$&
! A.-%'$B"&0'0"*&C*-;$3"/0'0$&
! :").4'$#@"*-$-#="2/$&! :").4'$;0<2%'0"*%3="2/$&
! :").4'$>"'%'0"*%3="2/$&
! :").4'$80($-2%3="2/$&
! :").4'$#?$0+(<"42&! :").4'$#?"*D@"*-0*+#="2/$&
! A.-%'$B"&0'0"*&C*-;$3"/0'0$&
! E*/$#+2"4.&#",#'%&1&#%2$#0-$*'0,0$-F#-%'%#,3"G#
/"*&'2%0*'&#$*,"2/$#%#.%2'0%3#"2-$26
! E*/$#+2"4.&#",#'%&1&#%2$#0-$*'0,0$-F#-%'%#,3"G#
/"*&'2%0*'&#$*,"2/$#%#.%2'0%3#"2-$26
A.-%'$#B"&0'0"*&#%*-#;$3"/0'0$&
?"*#@"*-$-#="2/$&
?$0+(<"2#H0&'
@"*-$-#="2/$&
!%&1#8$/")."&0'0"*
8%'%#8$/")."&0'0"*
I2"4.#!%&1&
E2-$2#!%&1&
8%'%#J(%20*+
8$/")."&0'0"* 8$.$*-$*/9#C*%39&0&
slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
F
! !"#$%&$#'($#)%*%+$)$*'#",#-$.$*-$*/0$&#,0*-#'%&1&#'(%'#%2$#&0)03%2#%*-#+2"4.#'($)
! 5+6#7"3$/43%2#89*%)0/&! :").4'$;0<2%'0"*%3="2/$&
! :").4'$>"'%'0"*%3="2/$&
! :").4'$80($-2%3="2/$&
! :").4'$?$0+(<"42&
! :").4'$?"*@"*-0*+="2/$&
! A.-%'$B"&0'0"*&C*-;$3"/0'0$&
! !"#$%&$#'($#)%*%+$)$*'#",#-$.$*-$*/0$&#,0*-#'%&1&#'(%'#%2$#&0)03%2#%*-#+2"4.#'($)
! 5+6#7"3$/43%2#89*%)0/&! :").4'$;0<2%'0"*%3="2/$&
! :").4'$>"'%'0"*%3="2/$&
! :").4'$80($-2%3="2/$&
! :").4'$?$0+(<"42&
! :").4'$?"*@"*-0*+="2/$&
! A.-%'$B"&0'0"*&C*-;$3"/0'0$&
! :").4'$#@"*-$-#="2/$&! :").4'$;0<2%'0"*%3="2/$&
! :").4'$>"'%'0"*%3="2/$&
! :").4'$80($-2%3="2/$&
! :").4'$#?$0+(<"42&! :").4'$#?"*D@"*-0*+#="2/$&
! A.-%'$B"&0'0"*&C*-;$3"/0'0$&
! E*/$#+2"4.&#",#'%&1&#%2$#0-$*'0,0$-F#-%'%#,3"G#
/"*&'2%0*'&#$*,"2/$#%#.%2'0%3#"2-$26
! E*/$#+2"4.&#",#'%&1&#%2$#0-$*'0,0$-F#-%'%#,3"G#
/"*&'2%0*'&#$*,"2/$#%#.%2'0%3#"2-$26
A.-%'$#B"&0'0"*&#%*-#;$3"/0'0$&
?"*#@"*-$-#="2/$&
?$0+(<"2#H0&'
@"*-$-#="2/$&
!%&1#8$/")."&0'0"*
8%'%#8$/")."&0'0"*
I2"4.#!%&1&
E2-$2#!%&1&
8%'%#J(%20*+
8$/")."&0'0"* 8$.$*-$*/9#C*%39&0&
slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
F
! !"#$%&$#'($#)%*%+$)$*'#",#-$.$*-$*/0$&#,0*-#'%&1&#'(%'#%2$#&0)03%2#%*-#+2"4.#'($)
! 5+6#7"3$/43%2#89*%)0/&! :").4'$;0<2%'0"*%3="2/$&
! :").4'$>"'%'0"*%3="2/$&
! :").4'$80($-2%3="2/$&
! :").4'$?$0+(<"42&
! :").4'$?"*@"*-0*+="2/$&
! A.-%'$B"&0'0"*&C*-;$3"/0'0$&
! !"#$%&$#'($#)%*%+$)$*'#",#-$.$*-$*/0$&#,0*-#'%&1&#'(%'#%2$#&0)03%2#%*-#+2"4.#'($)
! 5+6#7"3$/43%2#89*%)0/&! :").4'$;0<2%'0"*%3="2/$&
! :").4'$>"'%'0"*%3="2/$&
! :").4'$80($-2%3="2/$&
! :").4'$?$0+(<"42&
! :").4'$?"*@"*-0*+="2/$&
! A.-%'$B"&0'0"*&C*-;$3"/0'0$&
! :").4'$#@"*-$-#="2/$&! :").4'$;0<2%'0"*%3="2/$&
! :").4'$>"'%'0"*%3="2/$&
! :").4'$80($-2%3="2/$&
! :").4'$#?$0+(<"42&! :").4'$#?"*D@"*-0*+#="2/$&
! A.-%'$B"&0'0"*&C*-;$3"/0'0$&
! E*/$#+2"4.&#",#'%&1&#%2$#0-$*'0,0$-F#-%'%#,3"G#
/"*&'2%0*'&#$*,"2/$#%#.%2'0%3#"2-$26
! E*/$#+2"4.&#",#'%&1&#%2$#0-$*'0,0$-F#-%'%#,3"G#
/"*&'2%0*'&#$*,"2/$#%#.%2'0%3#"2-$26
A.-%'$#B"&0'0"*&#%*-#;$3"/0'0$&
?"*#@"*-$-#="2/$&
?$0+(<"2#H0&'
@"*-$-#="2/$&
!%&1#8$/")."&0'0"*
8%'%#8$/")."&0'0"*
I2"4.#!%&1&
E2-$2#!%&1&
8%'%#J(%20*+
8$/")."&0'0"* 8$.$*-$*/9#C*%39&0&
slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
F$
! !"#$%&'()*'++,%-(.$($.%/(-0&1%-2%)'131%'".%
&'()*)*-"1%-2%.')'%'($%*.$")*2*$.4%'"'+,5$%)6$%
!"#"$%&"'()*$)6')%-##0(1
! 7')'%16'(*"/%#'"%8$%#')$/-(*5$.%'19
! :$'.;-"+,
! <22$#)*=$+,%>-#'+
! :$'.;?(*)$
! @##0A0+')$
! B0+)*&+$%:$'.CD*"/+$%?(*)$
+,"!-.)/0
! 7')'%*1%($'.4%80)%"-)%E(*))$"
! F-%#-"1*1)$"#,%&(-8+$A1
! :$&+*#')*-"%*"%.*1)(*80)$.%1,1)$A
122,3#(4,/0-5.3"/
! 7')'%*1%($'.%'".%E(*))$"
! 7')'%*1%&'()*)*-"$.%*")-%1081$)1
! !"$%)'13%&$(%1081$)
! G'"%.*1)(*80)$%1081$)1
+,"!-6'(#,
! 7')'%*1%($'.%'".%E(*))$"
! B'",%)'131%'##$11%A'",%.')'
! G-"1*1)$"#,%*110$1
! B-1)%.*22*#0+)%)-%.$'+%E*)6
+,"!-6'(#,$!733898/"#(.)%
! @1%&$(%:$'.;?(*)$4%'+)6-0/6%E(*)$1%#-"1*1)%-2%'"%
'##0A0+')*-"%-&$(')*-"
! G-AA-"%*"%($.0#)*-";),&$%'+/-(*)6A1
! G'"%($&+*#')$%1*"#$%'##0A0+')*-"%#'"%8$%+*"$'(
slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
F$
! !"#$%&'()*'++,%-(.$($.%/(-0&1%-2%)'131%'".%
&'()*)*-"1%-2%.')'%'($%*.$")*2*$.4%'"'+,5$%)6$%
!"#"$%&"'()*$)6')%-##0(1
! 7')'%16'(*"/%#'"%8$%#')$/-(*5$.%'19
! :$'.;-"+,
! <22$#)*=$+,%>-#'+
! :$'.;?(*)$
! @##0A0+')$
! B0+)*&+$%:$'.CD*"/+$%?(*)$
+,"!-.)/0
! 7')'%*1%($'.4%80)%"-)%E(*))$"
! F-%#-"1*1)$"#,%&(-8+$A1
! :$&+*#')*-"%*"%.*1)(*80)$.%1,1)$A
122,3#(4,/0-5.3"/
! 7')'%*1%($'.%'".%E(*))$"
! 7')'%*1%&'()*)*-"$.%*")-%1081$)1
! !"$%)'13%&$(%1081$)
! G'"%.*1)(*80)$%1081$)1
+,"!-6'(#,
! 7')'%*1%($'.%'".%E(*))$"
! B'",%)'131%'##$11%A'",%.')'
! G-"1*1)$"#,%*110$1
! B-1)%.*22*#0+)%)-%.$'+%E*)6
+,"!-6'(#,$!733898/"#(.)%
! @1%&$(%:$'.;?(*)$4%'+)6-0/6%E(*)$1%#-"1*1)%-2%'"%
'##0A0+')*-"%-&$(')*-"
! G-AA-"%*"%($.0#)*-";),&$%'+/-(*)6A1
! G'"%($&+*#')$%1*"#$%'##0A0+')*-"%#'"%8$%+*"$'(
slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
F$
! !"#$%&'()*'++,%-(.$($.%/(-0&1%-2%)'131%'".%
&'()*)*-"1%-2%.')'%'($%*.$")*2*$.4%'"'+,5$%)6$%
!"#"$%&"'()*$)6')%-##0(1
! 7')'%16'(*"/%#'"%8$%#')$/-(*5$.%'19
! :$'.;-"+,
! <22$#)*=$+,%>-#'+
! :$'.;?(*)$
! @##0A0+')$
! B0+)*&+$%:$'.CD*"/+$%?(*)$
+,"!-.)/0
! 7')'%*1%($'.4%80)%"-)%E(*))$"
! F-%#-"1*1)$"#,%&(-8+$A1
! :$&+*#')*-"%*"%.*1)(*80)$.%1,1)$A
122,3#(4,/0-5.3"/
! 7')'%*1%($'.%'".%E(*))$"
! 7')'%*1%&'()*)*-"$.%*")-%1081$)1
! !"$%)'13%&$(%1081$)
! G'"%.*1)(*80)$%1081$)1
+,"!-6'(#,
! 7')'%*1%($'.%'".%E(*))$"
! B'",%)'131%'##$11%A'",%.')'
! G-"1*1)$"#,%*110$1
! B-1)%.*22*#0+)%)-%.$'+%E*)6
+,"!-6'(#,$!733898/"#(.)%
! @1%&$(%:$'.;?(*)$4%'+)6-0/6%E(*)$1%#-"1*1)%-2%'"%
'##0A0+')*-"%-&$(')*-"
! G-AA-"%*"%($.0#)*-";),&$%'+/-(*)6A1
! G'"%($&+*#')$%1*"#$%'##0A0+')*-"%#'"%8$%+*"$'(
slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
F$
! !"#$%&'()*'++,%-(.$($.%/(-0&1%-2%)'131%'".%
&'()*)*-"1%-2%.')'%'($%*.$")*2*$.4%'"'+,5$%)6$%
!"#"$%&"'()*$)6')%-##0(1
! 7')'%16'(*"/%#'"%8$%#')$/-(*5$.%'19
! :$'.;-"+,
! <22$#)*=$+,%>-#'+
! :$'.;?(*)$
! @##0A0+')$
! B0+)*&+$%:$'.CD*"/+$%?(*)$
+,"!-.)/0
! 7')'%*1%($'.4%80)%"-)%E(*))$"
! F-%#-"1*1)$"#,%&(-8+$A1
! :$&+*#')*-"%*"%.*1)(*80)$.%1,1)$A
122,3#(4,/0-5.3"/
! 7')'%*1%($'.%'".%E(*))$"
! 7')'%*1%&'()*)*-"$.%*")-%1081$)1
! !"$%)'13%&$(%1081$)
! G'"%.*1)(*80)$%1081$)1
+,"!-6'(#,
! 7')'%*1%($'.%'".%E(*))$"
! B'",%)'131%'##$11%A'",%.')'
! G-"1*1)$"#,%*110$1
! B-1)%.*22*#0+)%)-%.$'+%E*)6
+,"!-6'(#,$!733898/"#(.)%
! @1%&$(%:$'.;?(*)$4%'+)6-0/6%E(*)$1%#-"1*1)%-2%'"%
'##0A0+')*-"%-&$(')*-"
! G-AA-"%*"%($.0#)*-";),&$%'+/-(*)6A1
! G'"%($&+*#')$%1*"#$%'##0A0+')*-"%#'"%8$%+*"$'(
slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
F$
! !"#$%&'()*'++,%-(.$($.%/(-0&1%-2%)'131%'".%
&'()*)*-"1%-2%.')'%'($%*.$")*2*$.4%'"'+,5$%)6$%
!"#"$%&"'()*$)6')%-##0(1
! 7')'%16'(*"/%#'"%8$%#')$/-(*5$.%'19
! :$'.;-"+,
! <22$#)*=$+,%>-#'+
! :$'.;?(*)$
! @##0A0+')$
! B0+)*&+$%:$'.CD*"/+$%?(*)$
+,"!-.)/0
! 7')'%*1%($'.4%80)%"-)%E(*))$"
! F-%#-"1*1)$"#,%&(-8+$A1
! :$&+*#')*-"%*"%.*1)(*80)$.%1,1)$A
122,3#(4,/0-5.3"/
! 7')'%*1%($'.%'".%E(*))$"
! 7')'%*1%&'()*)*-"$.%*")-%1081$)1
! !"$%)'13%&$(%1081$)
! G'"%.*1)(*80)$%1081$)1
+,"!-6'(#,
! 7')'%*1%($'.%'".%E(*))$"
! B'",%)'131%'##$11%A'",%.')'
! G-"1*1)$"#,%*110$1
! B-1)%.*22*#0+)%)-%.$'+%E*)6
+,"!-6'(#,$!733898/"#(.)%
! @1%&$(%:$'.;?(*)$4%'+)6-0/6%E(*)$1%#-"1*1)%-2%'"%
'##0A0+')*-"%-&$(')*-"
! G-AA-"%*"%($.0#)*-";),&$%'+/-(*)6A1
! G'"%($&+*#')$%1*"#$%'##0A0+')*-"%#'"%8$%+*"$'(
slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
FF
!"#$%&'()"*!+,-)(.-"*!"#$/0(12-"*&'()"
! !"#$%&#'%()*+&,-%.#(/-01230#14/5#14%#-("6#7&,-%"
! '%/(8%)#914","-%495#914"-&(,4-"
! :8(;$/%<#=(-&,8#=2/-,$/,9(-,14
3 4
! :8(;$/%<#=(-&,8#=2/-,$/,9(-,14
3 4
'%()*>4/5
'%()*>4/5
:??%9-,@%/5*
A19(/
! :8(;$/%<#=1/%92/(&#B54(;,9"
C$)(-%#D1",-,14"#(4)#E%/19,-,%"
F14#G14)%)#H1&9%"
F%,30I1&#A,"-
G14)%)#H1&9%"
H1&9%"
! :8(;$/%<#=1/%92/(&#B54(;,9"
C$)(-%#D1",-,14"#(4)#E%/19,-,%"
F14#G14)%)#H1&9%"
F%,30I1&#A,"-
G14)%)#H1&9%"
!-1;,9#
J11&),4(-%"
slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
FF
!"#$%&'()"*!+,-)(.-"*!"#$/0(12-"*&'()"
! !"#$%&#'%()*+&,-%.#(/-01230#14/5#14%#-("6#7&,-%"
! '%/(8%)#914","-%495#914"-&(,4-"
! :8(;$/%<#=(-&,8#=2/-,$/,9(-,14
3 4
! :8(;$/%<#=(-&,8#=2/-,$/,9(-,14
3 4
'%()*>4/5
'%()*>4/5
:??%9-,@%/5*
A19(/
! :8(;$/%<#=1/%92/(&#B54(;,9"
C$)(-%#D1",-,14"#(4)#E%/19,-,%"
F14#G14)%)#H1&9%"
F%,30I1&#A,"-
G14)%)#H1&9%"
H1&9%"
! :8(;$/%<#=1/%92/(&#B54(;,9"
C$)(-%#D1",-,14"#(4)#E%/19,-,%"
F14#G14)%)#H1&9%"
F%,30I1&#A,"-
G14)%)#H1&9%"
!-1;,9#
J11&),4(-%"
slide by Matthew Bolitho
!"#$$%&$'()*+,-.(/$$0
1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86
/E#E/$$0
FF
!"#$%&'()"*!+,-)(.-"*!"#$/0(12-"*&'()"
! !"#$%&#'%()*+&,-%.#(/-01230#14/5#14%#-("6#7&,-%"
! '%/(8%)#914","-%495#914"-&(,4-"
! :8(;$/%<#=(-&,8#=2/-,$/,9(-,14
3 4
! :8(;$/%<#=(-&,8#=2/-,$/,9(-,14
3 4
'%()*>4/5
'%()*>4/5
:??%9-,@%/5*
A19(/
! :8(;$/%<#=1/%92/(&#B54(;,9"
C$)(-%#D1",-,14"#(4)#E%/19,-,%"
F14#G14)%)#H1&9%"
F%,30I1&#A,"-
G14)%)#H1&9%"
H1&9%"
! :8(;$/%<#=1/%92/(&#B54(;,9"
C$)(-%#D1",-,14"#(4)#E%/19,-,%"
F14#G14)%)#H1&9%"
F%,30I1&#A,"-
G14)%)#H1&9%"
!-1;,9#
J11&),4(-%"
slide by Matthew Bolitho
Useful patterns(for reference)
Embarrassingly Parallel
yi = fi(xi)where i ∈ {1, . . . ,N}.
Notation: (also for rest of this lecture)
• xi : inputs
• yi : outputs
• fi : (pure) functions (i.e. no side effects)
When does a function have a “side effect”?
In addition to producing a value, it
• modifies non-local state, or
• has an observable interaction with the
outside world.
Often: f1 = · · · = fN . Then
• Lisp/Python function map
• C++ STL std::transform
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Embarrassingly Parallel
yi = fi(xi)where i ∈ {1, . . . ,N}.
Notation: (also for rest of this lecture)
• xi : inputs
• yi : outputs
• fi : (pure) functions (i.e. no side effects)
When does a function have a “side effect”?
In addition to producing a value, it
• modifies non-local state, or
• has an observable interaction with the
outside world.
Often: f1 = · · · = fN . Then
• Lisp/Python function map
• C++ STL std::transform
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Embarrassingly Parallel
yi = fi(xi)where i ∈ {1, . . . ,N}.
Notation: (also for rest of this lecture)
• xi : inputs
• yi : outputs
• fi : (pure) functions (i.e. no side effects)
When does a function have a “side effect”?
In addition to producing a value, it
• modifies non-local state, or
• has an observable interaction with the
outside world.
Often: f1 = · · · = fN . Then
• Lisp/Python function map
• C++ STL std::transform
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Embarrassingly Parallel: Graph Representation
x0
y0
f0
x1
y1
f1
x2
y2
f2
x3
y3
f3
x4
y4
f4
x5
y5
f5
x6
y6
f6
x7
y7
f7
x8
y8
f8
Trivial? Often: no.
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Embarrassingly Parallel: Examples
Surprisingly useful:
• Element-wise linear algebra:Addition, scalar multiplication (notinner product)
• Image Processing: Shift, rotate,clip, scale, . . .
• Monte Carlo simulation
• (Brute-force) Optimization
• Random Number Generation
• Encryption, Compression(after blocking)
• Software compilation• make -j8
But: Still needs a minimum ofcoordination. How can that beachieved?
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Mother-Child ParallelismMother-Child parallelism:
Mother
0 1 2 3 4
Children
Send initial data
Collect results
(formerly called “Master-Slave”)Embarrassing Partition Pipelines Reduction Scan
slide from Berger & Klöckner (NYU 2010)
Embarrassingly Parallel: Issues
• Process Creation:Dynamic/Static?
• MPI 2 supports dynamic processcreation
• Job Assignment (‘Scheduling’):Dynamic/Static?
• Operations/data light- orheavy-weight?
• Variable-size data?• Load Balancing:
• Here: easy
Can you think of a loadbalancing recipe?
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Partition
yi = fi(xi−1, xi , xi+1)
where i ∈ {1, . . . ,N}.
Includes straightforward generalizations to dependencies on a larger(but not O(P)-sized!) set of neighbor inputs.
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Partition: Graph
x0 x1 x2 x3 x4 x5 x6
y1 y2 y3 y4 y5
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Partition: Examples
• Time-marching(in particular: PDE solvers)
• (Including finite differences → HW3!)
• Iterative Methods• Solve Ax = b (Jacobi, . . . )• Optimization (all P on single problem)• Eigenvalue solvers
• Cellular Automata (Game of Life :-)
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
)
Partition: Issues
• Only useful when the computation
is mainly local
• Responsibility for updating one
datum rests with one processor
• Synchronization, Deadlock,
Livelock, . . .
• Performance Impact
• Granularity
• Load Balancing: Thorny issue
• → next lecture
• Regularity of the Partition?
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Pipelined Computation
y = fN(· · · f2(f1(x)) · · · )= (fN ◦ · · · ◦ f1)(x)
where N is fixed.
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Pipelined Computation: Graph
x yf1 f1 f2 f3 f4 f6
Processor Assignment?
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Pipelined Computation: Examples
• Image processing
• Any multi-stage algorithm
• Pre/post-processing or I/O
• Out-of-Core algorithms
Specific simple examples:
• Sorting (insertion sort)
• Triangular linear system solve
(‘backsubstitution’)
• Key: Pass on values as soon as
they’re available
(will see more efficient algorithms for
both later)
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Pipelined Computation: Issues
• Non-optimal while pipeline fills or
empties
• Often communication-inefficient
• for large data
• Needs some attention to
synchronization, deadlock
avoidance
• Can accommodate some
asynchrony
But don’t want:
• Pile-up
• Starvation
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Reduction
y = f (· · · f (f (x1, x2), x3), . . . , xN)
where N is the input size.
Also known as. . .
• Lisp/Python function reduce (Scheme: fold)
• C++ STL std::accumulate
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Reduction: Graph
y
x1 x2
x3
x4
x5
x6
Painful! Not parallelizable.
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Approach to Reduction
f (x ,y)?
Can we do better?
“Tree” very imbalanced. What property
of f would allow ‘rebalancing’?
f (f (x , y), z) = f (x , f (y , z))
Looks less improbable if we let
x ◦ y = f (x , y):
x ◦ (y ◦ z)) = (x ◦ y) ◦ z
Has a very familiar name: Associativity
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Reduction: A Better Graph
y
x0 x1 x2 x3 x4 x5 x6 x7
Processor allocation?
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Mapping Reduction to the GPU
• Obvious: Want to use tree-based approach.
• Problem: Two scales, Work group and Grid
• Need to occupy both to make good use of the machine.
• In particular, need synchronization after each tree stage.
• Solution: Use a two-scale algorithm.
5
Solution: Kernel DecompositionSolution: Kernel Decomposition
Avoid global sync by decomposing computation into multiple kernel invocations
In the case of reductions, code for all levels is the same
Recursive kernel invocation
4 7 5 911 14
25
3 1 7 0 4 1 6 3
4 7 5 911 14
25
3 1 7 0 4 1 6 3
4 7 5 911 14
25
3 1 7 0 4 1 6 3
4 7 5 911 14
25
3 1 7 0 4 1 6 3
4 7 5 911 14
25
3 1 7 0 4 1 6 3
4 7 5 911 14
25
3 1 7 0 4 1 6 3
4 7 5 911 14
25
3 1 7 0 4 1 6 3
4 7 5 911 14
25
3 1 7 0 4 1 6 3
4 7 5 911 14
25
3 1 7 0 4 1 6 3
Level 0:8 blocks
Level 1:1 block
In particular: Use multiple grid invocations to achieve
inter-workgroup synchronization.With material by M. Harris
(Nvidia Corp.)
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Interleaved Addressing
8
Parallel Reduction: Interleaved AddressingParallel Reduction: Interleaved Addressing
2011072-3-253-20-18110Values (shared memory)
0 2 4 6 8 10 12 14
22111179-3-558-2-2-17111Values
0 4 8 12
22111379-3458-26-17118Values
0 8
22111379-31758-26-17124Values
0
22111379-31758-26-17141Values
Thread IDs
Step 1 Stride 1
Step 2 Stride 2
Step 3 Stride 4
Step 4 Stride 8
Thread IDs
Thread IDs
Thread IDs
Issue: Slow modulo, Divergence
With material by M. Harris
(Nvidia Corp.)
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Sequential Addressing
14
Parallel Reduction: Sequential AddressingParallel Reduction: Sequential Addressing
2011072-3-253-20-18110Values (shared memory)
0 1 2 3 4 5 6 7
2011072-3-27390610-28Values
0 1 2 3
2011072-3-27390131378Values
0 1
2011072-3-2739013132021Values
0
2011072-3-2739013132041Values
Thread IDs
Step 1 Stride 8
Step 2 Stride 4
Step 3 Stride 2
Step 4 Stride 1
Thread IDs
Thread IDs
Thread IDs
Sequential addressing is conflict freeBetter! But still not “efficient”.
Only half of all work items after first round,
then a quarter, . . . With material by M. Harris
(Nvidia Corp.)
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Reduction: Examples
• Sum, Inner Product, Norm
• Occurs in iterative methods
• Minimum, Maximum
• Data Analysis
• Evaluation of Monte Carlo
Simulations
• List Concatenation, Set Union
• Matrix-Vector product (but. . . )
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Reduction: Issues
• When adding: floating point
cancellation?
• Serial order goes faster:
can use registers for intermediate
results
• Requires availability of neutral
element
• GPU-Reduce: Optimization
sensitive to data type
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Map-Reduce
y = f (· · · f (f (g(x1), g(x2)),g(x3)), . . . , g(xN))
where N is the input size.
• Lisp naming, again
• Mild generalization of reduction
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Map-Reduce: Graph
y
x0
g
x1
g
x2
g
x3
g
x4
g
x5
g
x6
g
x7
g
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
MapReduce: Discussion
MapReduce ≥ map + reduce:
• Used by Google (and many others) forlarge-scale data processing
• Map generates (key, value) pairs• Reduce operates only on pairs with
identical keys• Remaining output sorted by key
• Represent all data as character strings• User must convert to/from internal repr.
• Messy implementation• Parallelization, fault tolerance, monitoring,
data management, load balance, re-run“stragglers”, data locality
• Works for Internet-size data
• Simple to use even for inexperienced users
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
MapReduce: Examples
• String search
• (e.g. URL) Hit count from Log
• Reverse web-link graph
• desired: (target URL, sources)
• Sort
• Indexing
• desired: (word, document IDs)
• Machine Learning, Clustering, . . .
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Scan
y1 = x1y2 = f (y1, x2)... = ...
yN = f (yN−1, xN)
where N is the input size.
• Also called “prefix sum”.
• Or cumulative sum (‘cumsum’) by Matlab/NumPy.
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Scan: Graph
x0
y0
x1
y1
x2
y2
x3
y3
x4
y4
x5
y5
y1
Id
y2
Id
y3
Id
y4
Id y5
Id
Id
This can’t possibly be parallelized.
Or can it?
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Scan: Graph
x0
y0
x1
y1
x2
y2
x3
y3
x4
y4
x5
y5
y1
Id
y2
Id
y3
Id
y4
Id y5
Id
Id
This can’t possibly be parallelized.
Or can it?
Again: Need assumptions on f .Associativity, commutativity.
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Scan: Implementation
Work-efficient?
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Scan: Implementation II
Two sweeps: Upward, downward,
both tree-shape
On upward sweep:
• Get values L and R from left and right
child
• Save L in local variable Mine
• Compute Tmp = L+ R and pass to parent
On downward sweep:
• Get value Tmp from parent
• Send Tmp to left child
• Sent Tmp+Mine to right child
Work-efficient?
Span rel. to first attempt?
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Scan: Implementation II
Two sweeps: Upward, downward,
both tree-shape
On upward sweep:
• Get values L and R from left and right
child
• Save L in local variable Mine
• Compute Tmp = L+ R and pass to parent
On downward sweep:
• Get value Tmp from parent
• Send Tmp to left child
• Sent Tmp+Mine to right childWork-efficient?
Span rel. to first attempt?
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Scan: Examples
• Anything with a loop-carried
dependence
• One row of Gauss-Seidel
• One row of triangular solve
• Segment numbering if boundaries
are known
• Low-level building block for many
higher-level algorithms algorithms
• FIR/IIR Filtering
• G.E. Blelloch:
Prefix Sums and their Applications
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Scan: Issues
• Subtlety: Inclusive/Exclusive Scan• Pattern sometimes hard torecognize
• But shows up surprisingly often• Need to prove
associativity/commutativity
• Useful in Implementation:algorithm cascading
• Do sequential scan on parts, thenparallelize at coarser granularities
Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)
Divide and Conquer
yi = fi(x1, . . . , xN)for i ∈ {1, dots,M}.
Main purpose: A way ofpartitioning up fullydependent tasks.
x0 x1 x2 x3 x4 x5 x6 x7
x0 x1 x2 x3 x4 x5 x6 x7
x0 x1 x2 x3 x4 x5 x6 x7
u0 u1 u2 u3 u4 u5 u6 u7
x0
y0
x1
y1
x2
y2
x3
y3
x4
y4
x5
y5
x6
y6
x7
y7
v0 v1 v2 v3 v4 v5 v6 v7
w0 w1 w2 w3 w4 w5 w6 w7Processor allocation?
D&C Generalslide from Berger & Klöckner (NYU 2010)
Divide and Conquer: Examples
• GEMM, TRMM, TRSM, GETRF
(LU)
• FFT
• Sorting: Bucket sort, Merge sort
• N-Body problems (Barnes-Hut,
FMM)
• Adaptive Integration
More fun with work and span:
D&C analysis lecture
D&C Generalslide from Berger & Klöckner (NYU 2010)
Divide and Conquer: Issues
• “No idea how to parallelize that”• → Try D&C
• Non-optimal during partition, merge• But: Does not matter if deep levels do
heavy enough processing
• Subtle to map to fixed-width machines(e.g. GPUs)
• Varying data size along tree
• Bookkeeping nontrivial for non-2n sizes
• Side benefit: D&C is generallycache-friendly
D&C Generalslide from Berger & Klöckner (NYU 2010)
COME
Top Related