[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

232
Lecture #2: Architecture, Theory & Patterns | February 1st, 2011 Nicolas Pinto (MIT, Harvard) [email protected] Massively Parallel Computing CS 264 / CSCI E-292

Transcript of [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Page 1: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Lecture #2: Architecture, Theory & Patterns | February 1st, 2011

Nicolas Pinto (MIT, Harvard) [email protected]

Massively Parallel ComputingCS 264 / CSCI E-292

Page 2: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Objectives

• introduce important computational thinking skills for massively parallel computing

• understand hardware limitations

• understand algorithm constraints

• identify common patterns

Page 3: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

During this course,

we’ll try to

and use existing material ;-)

“ ”

adapted for CS264

Page 4: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
Page 5: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Outline

• Thinking Parallel

• Architecture

• Programming Model

• Bits of Theory

• Patterns

Page 6: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

F

!"#$%&"'()'*+&+,,",'-./0%$123

!"#"$$%$&'()*+,-./&-0&"&1(#)&(1&'()*+,-./&-.&

23-'3&)".4&-.0,#+',-(.0&"#%&'"##-%5&(+,&

0-)+$,".%(+0$441510"61+

! 7&+61$1.2+,,8)',+&3"9'":0"2;1<"9';0"#1+,1="6

! >:.$1#'?%0"&#./0%$"&;'@>3)'-&+8A

! B1;$&1C%$"6'?8;$"/;'@>3)'D?-E'4F1$"9'G,%"H"2"A'

! *+&+,,",'#./0%$123'I+;'$&+61$1.2+,,8'

12+##";;1C,"'$.'$F"'#.//.61$8'/+&5"$0,+#"

!"#$%&'()*$+,-.%/'0%(,1,(2(%&'()'1$1-%&'3-3%#43%

-.'#%"0%5&",&"&#",%&(1&#(+/3$4&"&1"',(#&(1&,2(&*%#&4%"#&666&

7%#,"-.$4&(8%#&,3%&03(#,&,%#)&,3-0&#",%&'".&9%&%:*%',%5&,(&

'(.,-.+%;&-1&.(,&,(&-.'#%"0%6&<8%#&,3%&$(./%#&,%#);&,3%&

#",%&(1&-.'#%"0%&-0&"&9-,&)(#%&+.'%#,"-.;&"$,3(+/3&,3%#%&-0&

.(&#%"0(.&,(&9%$-%8%&-,&2-$$&.(,&#%)"-.&.%"#$4&'(.0,".,&1(#&

",&$%"0,&=>&4%"#06&?3",&)%".0&94&=@AB;&,3%&.+)9%#&(1&

'()*(.%.,0&*%#&-.,%/#",%5&'-#'+-,&1(#&)-.-)+)&'(0,&2-$$&

9%&CB;>>>6&D&9%$-%8%&,3",&0+'3&"&$"#/%&'-#'+-,&'".&9%&9+-$,&

'1%4%3,15*$%64/$07

H.&6.2'J..&"9'>,"#$&.21#;'J+3+=12"9'KL'D0&1,'KLMN

! 7F"'/.;$'"#.2./1#'2%/C"&'.O'#./0.2"2$;'

12'+2'E-'I1,,'6.%C,"'"<"&8'8"+&

! P1;$.&1#+,,8'!-*Q;'3"$'O+;$"&

"P+&6I+&"'&"+#F123'O&"R%"2#8',1/1$+$1.2;

! S.I'!-*Q;'3"$'I16"&

! T+$F"&'$F+2'":0"#$123'-*Q;'$.'3"$'$I1#"'+;'

O+;$9'":0"#$'$.'F+<"'$I1#"'+;'/+28U

! *+&+,,",'0&.#";;123'O.&'$F"'/+;;";

! Q2O.&$%2+$",8)'*+&+,,",'0&.3&+//123'1;'F+&6V''

"D,3.&1$F/;'+26'B+$+'?$&%#$%&";'/%;$'C"'O%26+/"2$+,,8'&"6";132"6

slide by Matthew Bolitho

Motivation

Page 7: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

F

!"#$%&"'()'*+&+,,",'-./0%$123

!"#"$$%$&'()*+,-./&-0&"&1(#)&(1&'()*+,-./&-.&

23-'3&)".4&-.0,#+',-(.0&"#%&'"##-%5&(+,&

0-)+$,".%(+0$441510"61+

! 7&+61$1.2+,,8)',+&3"9'":0"2;1<"9';0"#1+,1="6

! >:.$1#'?%0"&#./0%$"&;'@>3)'-&+8A

! B1;$&1C%$"6'?8;$"/;'@>3)'D?-E'4F1$"9'G,%"H"2"A'

! *+&+,,",'#./0%$123'I+;'$&+61$1.2+,,8'

12+##";;1C,"'$.'$F"'#.//.61$8'/+&5"$0,+#"

!"#$%&'()*$+,-.%/'0%(,1,(2(%&'()'1$1-%&'3-3%#43%

-.'#%"0%5&",&"&#",%&(1&#(+/3$4&"&1"',(#&(1&,2(&*%#&4%"#&666&

7%#,"-.$4&(8%#&,3%&03(#,&,%#)&,3-0&#",%&'".&9%&%:*%',%5&,(&

'(.,-.+%;&-1&.(,&,(&-.'#%"0%6&<8%#&,3%&$(./%#&,%#);&,3%&

#",%&(1&-.'#%"0%&-0&"&9-,&)(#%&+.'%#,"-.;&"$,3(+/3&,3%#%&-0&

.(&#%"0(.&,(&9%$-%8%&-,&2-$$&.(,&#%)"-.&.%"#$4&'(.0,".,&1(#&

",&$%"0,&=>&4%"#06&?3",&)%".0&94&=@AB;&,3%&.+)9%#&(1&

'()*(.%.,0&*%#&-.,%/#",%5&'-#'+-,&1(#&)-.-)+)&'(0,&2-$$&

9%&CB;>>>6&D&9%$-%8%&,3",&0+'3&"&$"#/%&'-#'+-,&'".&9%&9+-$,&

'1%4%3,15*$%64/$07

H.&6.2'J..&"9'>,"#$&.21#;'J+3+=12"9'KL'D0&1,'KLMN

! 7F"'/.;$'"#.2./1#'2%/C"&'.O'#./0.2"2$;'

12'+2'E-'I1,,'6.%C,"'"<"&8'8"+&

! P1;$.&1#+,,8'!-*Q;'3"$'O+;$"&

"P+&6I+&"'&"+#F123'O&"R%"2#8',1/1$+$1.2;

! S.I'!-*Q;'3"$'I16"&

! T+$F"&'$F+2'":0"#$123'-*Q;'$.'3"$'$I1#"'+;'

O+;$9'":0"#$'$.'F+<"'$I1#"'+;'/+28U

! *+&+,,",'0&.#";;123'O.&'$F"'/+;;";

! Q2O.&$%2+$",8)'*+&+,,",'0&.3&+//123'1;'F+&6V''

"D,3.&1$F/;'+26'B+$+'?$&%#$%&";'/%;$'C"'O%26+/"2$+,,8'&"6";132"6

Motivation

slide by Matthew Bolitho

Page 8: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Thinking Parallel

Page 9: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Getting your feet wet

• Common scenario: “I want to make the algorithm X run faster, help me!”

• Q: How do you approach the problem?

Page 10: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

How?

Page 11: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
Page 12: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

How?

• Option 1: wait

• Option 2: gcc -O3 -msse4.2

• Option 3: xlc -O5

• Option 4: use parallel libraries (e.g. (cu)blas)

• Option 5: hand-optimize everything!

• Option 6: wait more

Page 13: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

What else ?

Page 14: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

How about analysis ?

Page 15: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Getting your feet wet

0

25

50

75

100

load_data() foo() bar() yey()

50

1110

29

time

(s)

Algorithm X v1.0 Profiling Analysis on Input 10x10x10

sequential in nature

100% parallelizable

Q: What is the maximum speed up ?

Page 16: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Getting your feet wet

0

25

50

75

100

load_data() foo() bar() yey()

50

1110

29

time

(s)

Algorithm X v1.0 Profiling Analysis on Input 10x10x10

sequential in nature

100% parallelizable

A: 2X ! :-(

Page 17: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Getting your feet wet

0

2,250

4,500

6,750

9,000

load_data() foo() bar() yey()

9,000

300250350

time

(s)

Algorithm X v1.0 Profiling Analysis on Input 100x100x100

sequential in nature

100% parallelizable

Q: and now?

Page 18: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

You need to...

• ... understand the problem (duh!)

• ... study the current (sequential?) solutions and their constraints

• ... know the input domain

• ... profile accordingly

• ... “refactor” based on new constraints (hw/sw)

Page 19: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

A better way ?

Speculation: (input) domain-aware optimization using some sort of probabilistic modeling ?

...

doesn’t scale !

Page 20: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

9 Some Perspective

Technical Problem to be Analyzed

Direct elimination equation solver

Discretization "A"

Scientific Model "A"

Sequential implementationParallel implementation

Iterative equation solver

Discretization "B"

Consultation with experts

Model "B"

Experiments

Theoretical analysis

Figure 11: The “problem tree” for scientific problem solving. There are many

options to try to achieve the same goal.

56

from Scott et al. “Scientific Parallel Computing” (2005)

Some PerspectiveThe “problem tree” for scientific problem solving

There are many options to try to achieve the same goal.

Page 21: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Computational Thinking

• translate/formulate domain problems into computational models that can be solved efficiently by available computing resources

• requires a deep understanding of their relationships

adapted from Hwu & Kirk (PASI 2011)

Page 22: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Architecture Algorithms

ParallelComputing

Languages

APPLICATIONS

Figure 3: Knowledge of algorithms, architecture, and languages contributes to ef-

fective use of parallel computers in practical applications.

9

adapted from Scott et al. “Scientific Parallel Computing” (2005)

Getting ready...

CompilersPatterns

Programming Models

Parallel Thinking

Page 23: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Fundamental Skills

• Computer architecture

• Programming models and compilers

• Algorithm techniques and patterns

• Domain knowledge

Page 24: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Computer Architecture

• memory organization, bandwidth and latency; caching and locality (memory hierarchy)

• floating-point precision vs. accuracy

• SISD, SIMD, MISD, MIMD vs. SIMT, SPMD

critical in understanding tradeoffs btw algorithms

Page 25: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Programming models

• parallel execution models (threading hierarchy)

• optimal memory access patterns

• array data layout and loop transformations

for optimal data structure and code execution

Page 26: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Algorithms and patterns

• toolbox for designing good parallel algorithms

• it is critical to understand their scalability and efficiency

• many have been exposed and documented

• sometimes hard to “extract”

• ... but keep trying!

Page 27: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Domain Knowledge

• abstract modeling

• mathematical properties

• accuracy requirements

• coming back to the drawing board to expose more/better parallelism ?

Page 28: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

You can do it!

• thinking parallel is not as hard as you may think

• many techniques have been thoroughly explained...

• ... and are now “accessible” to non-experts !

Page 29: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Architecture

Page 30: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Architecture

• What’s in a (basic) computer?

• Basic Subsystems

• Machine Language

• Memory Hierarchy

• Pipelines

• CPUs to GPUs

Page 31: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Architecture

• What’s in a (basic) computer?

• Basic Subsystems

• Machine Language

• Memory Hierarchy

• Pipelines

• CPUs to GPUs

Page 32: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

What’s in a computer?

Processor

Intel Q6600 Core2 Quad, 2.4 GHz

Die

(2×) 143 mm2, 2× 2 cores

582,000,000 transistors

∼ 100W

Memory

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 33: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

What’s in a computer?

Processor

Intel Q6600 Core2 Quad, 2.4 GHz

Die

(2×) 143 mm2, 2× 2 cores

582,000,000 transistors

∼ 100W

Memory

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 34: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

What’s in a computer?

Processor

Intel Q6600 Core2 Quad, 2.4 GHz

Die

(2×) 143 mm2, 2× 2 cores

582,000,000 transistors

∼ 100W

Memory

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 35: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

What’s in a computer?

Processor

Intel Q6600 Core2 Quad, 2.4 GHz

Die

(2×) 143 mm2, 2× 2 cores

582,000,000 transistors

∼ 100W

Memory

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 36: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

What’s in a computer?

Processor

Intel Q6600 Core2 Quad, 2.4 GHz

Die

(2×) 143 mm2, 2× 2 cores

582,000,000 transistors

∼ 100W

Memory

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 37: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Architecture

• What’s in a (basic) computer?

• Basic Subsystems

• Machine Language

• Memory Hierarchy

• Pipelines

Page 38: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

A Basic Processor

Internal Bus

Register FileFlags

Data ALU

Address ALU

Control UnitPC

Memory Interface

Insn.fetch

Data Bus

Address Bus

(loosely based on Intel 8086)

Bonus Question:What’s a bus?

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 39: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

How all of this fits together

Everything synchronizes to the Clock.

Control Unit (“CU”): The brains of theoperation. Everything connects to it.

Bus entries/exits are gated and(potentially) buffered.

CU controls gates, tells other unitsabout ‘what’ and ‘how’:

• What operation?

• Which register?

• Which addressing mode?

Internal Bus

Register FileFlags

Data ALU

Address ALU

Control UnitPC

Memory Interface

Insn.fetch

Data Bus

Address Bus

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 40: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

What is. . . an ALU?

Arithmetic Logic Unit

One or two operands A, B

Operation selector (Op):

• (Integer) Addition, Subtraction

• (Logical) And, Or, Not

• (Bitwise) Shifts (equivalent to

multiplication by power of two)

• (Integer) Multiplication, Division

Specialized ALUs:

• Floating Point Unit (FPU)

• Address ALU

Operates on binary representations of

numbers. Negative numbers represented

by two’s complement.

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 41: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

What is. . . a Register File?

Registers are On-Chip Memory

• Directly usable as operands in

Machine Language

• Often “general-purpose”

• Sometimes special-purpose: Floating

point, Indexing, Accumulator

• Small: x86 64: 16×64 bit GPRs

• Very fast (near-zero latency)

%r0

%r1

%r2

%r3

%r4

%r5

%r6

%r7

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 42: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

How does computer memory work?

One (reading) memory transaction (simplified):

Processor Memory

CLK

R/W

A0..15

D0..15

Observation: Access (and addressing) happens

in bus-width-size “chunks”.

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 43: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

How does computer memory work?

One (reading) memory transaction (simplified):

Processor Memory

CLK

R/W

A0..15

D0..15

Observation: Access (and addressing) happens

in bus-width-size “chunks”.

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 44: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

How does computer memory work?

One (reading) memory transaction (simplified):

Processor Memory

CLK

R/W

A0..15

D0..15

Observation: Access (and addressing) happens

in bus-width-size “chunks”.

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 45: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

How does computer memory work?

One (reading) memory transaction (simplified):

Processor Memory

CLK

R/W

A0..15

D0..15

Observation: Access (and addressing) happens

in bus-width-size “chunks”.

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 46: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

How does computer memory work?

One (reading) memory transaction (simplified):

Processor Memory

CLK

R/W

A0..15

D0..15

Observation: Access (and addressing) happens

in bus-width-size “chunks”.

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 47: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

How does computer memory work?

One (reading) memory transaction (simplified):

Processor Memory

CLK

R/W

A0..15

D0..15

Observation: Access (and addressing) happens

in bus-width-size “chunks”.

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 48: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

How does computer memory work?

One (reading) memory transaction (simplified):

Processor Memory

CLK

R/W

A0..15

D0..15

Observation: Access (and addressing) happens

in bus-width-size “chunks”.

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 49: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

What is. . . a Memory Interface?

Memory Interface gets and stores binarywords in off-chip memory.

Smallest granularity: Bus width

Tells outside memory

• “where” through address bus

• “what” through data bus

Computer main memory is “Dynamic RAM”(DRAM): Slow, but small and cheap.

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 50: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Architecture

• What’s in a (basic) computer?

• Basic Subsystems

• Machine Language

• Memory Hierarchy

• Pipelines

• CPUs to GPUs

Page 51: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

A Very Simple Program

int a = 5;int b = 17;int z = a ∗ b;

4: c7 45 f4 05 00 00 00 movl $0x5,−0xc(%rbp)b: c7 45 f8 11 00 00 00 movl $0x11,−0x8(%rbp)12: 8b 45 f4 mov −0xc(%rbp),%eax15: 0f af 45 f8 imul −0x8(%rbp),%eax19: 89 45 fc mov %eax,−0x4(%rbp)1c: 8b 45 fc mov −0x4(%rbp),%eax

Things to know:

• Addressing modes (Immediate, Register, Base plus Offset)

• 0xHexadecimal

• “AT&T Form”: (we’ll use this)<opcode><size> <source>, <dest>

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 52: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

A Very Simple Program: Intel Form

4: c7 45 f4 05 00 00 00 mov DWORD PTR [rbp−0xc],0x5

b: c7 45 f8 11 00 00 00 mov DWORD PTR [rbp−0x8],0x11

12: 8b 45 f4 mov eax,DWORD PTR [rbp−0xc]

15: 0f af 45 f8 imul eax,DWORD PTR [rbp−0x8]

19: 89 45 fc mov DWORD PTR [rbp−0x4],eax

1c: 8b 45 fc mov eax,DWORD PTR [rbp−0x4]

• “Intel Form”: (you might see this on the net)<opcode> <sized dest>, <sized source>

• Goal: Reading comprehension.

• Don’t understand an opcode?Google “<opcode> intel instruction”.

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 53: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Machine Language Loops

int main(){int y = 0, i ;for ( i = 0;

y < 10; ++i)y += i;

return y;}

0: 55 push %rbp1: 48 89 e5 mov %rsp,%rbp4: c7 45 f8 00 00 00 00 movl $0x0,−0x8(%rbp)b: c7 45 fc 00 00 00 00 movl $0x0,−0x4(%rbp)12: eb 0a jmp 1e <main+0x1e>14: 8b 45 fc mov −0x4(%rbp),%eax17: 01 45 f8 add %eax,−0x8(%rbp)1a: 83 45 fc 01 addl $0x1,−0x4(%rbp)1e: 83 7d f8 09 cmpl $0x9,−0x8(%rbp)22: 7e f0 jle 14 <main+0x14>24: 8b 45 f8 mov −0x8(%rbp),%eax27: c9 leaveq28: c3 retq

Things to know:

• Condition Codes (Flags): Zero, Sign, Carry, etc.

• Call Stack: Stack frame, stack pointer, base pointer

• ABI: Calling conventions

Want to make those yourself?Write myprogram.c.$ cc -c myprogram.c

$ objdump --disassemble myprogram.o

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 54: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Machine Language Loops

int main(){int y = 0, i ;for ( i = 0;

y < 10; ++i)y += i;

return y;}

0: 55 push %rbp1: 48 89 e5 mov %rsp,%rbp4: c7 45 f8 00 00 00 00 movl $0x0,−0x8(%rbp)b: c7 45 fc 00 00 00 00 movl $0x0,−0x4(%rbp)12: eb 0a jmp 1e <main+0x1e>14: 8b 45 fc mov −0x4(%rbp),%eax17: 01 45 f8 add %eax,−0x8(%rbp)1a: 83 45 fc 01 addl $0x1,−0x4(%rbp)1e: 83 7d f8 09 cmpl $0x9,−0x8(%rbp)22: 7e f0 jle 14 <main+0x14>24: 8b 45 f8 mov −0x8(%rbp),%eax27: c9 leaveq28: c3 retq

Things to know:

• Condition Codes (Flags): Zero, Sign, Carry, etc.

• Call Stack: Stack frame, stack pointer, base pointer

• ABI: Calling conventions

Want to make those yourself?Write myprogram.c.$ cc -c myprogram.c

$ objdump --disassemble myprogram.o

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 55: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

We know how a computer works!

All of this can be built in about 4000 transistors.

(e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)

So what exactly is Intel doing with the other 581,996,000

transistors?

Answer:

Make things go faster!

Goal now:Understand sources of slowness, and how they get addressed.

Remember: High Performance Computing

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 56: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

We know how a computer works!

All of this can be built in about 4000 transistors.

(e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)

So what exactly is Intel doing with the other 581,996,000

transistors?

Answer: Make things go faster!

Goal now:Understand sources of slowness, and how they get addressed.

Remember: High Performance Computing

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 57: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

We know how a computer works!

All of this can be built in about 4000 transistors.

(e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)

So what exactly is Intel doing with the other 581,996,000

transistors?

Answer: Make things go faster!

Goal now:Understand sources of slowness, and how they get addressed.

Remember: High Performance Computing

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 58: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

The High-Performance Mindset

Writing high-performance Codes

Mindset: What is going to be the limitingfactor?

• ALU?

• Memory?

• Communication? (if multi-machine)

Benchmark the assumed limiting factor rightaway.

Evaluate

• Know your peak throughputs (roughly)

• Are you getting close?

• Are you tracking the right limiting factor?

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 59: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Architecture

• What’s in a (basic) computer?

• Basic Subsystems

• Machine Language

• Memory Hierarchy

• Pipelines

• CPUs to GPUs

Page 60: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Source of Slowness: MemoryMemory is slow.

Distinguish two different versions of “slow”:• Bandwidth• Latency

→ Memory has long latency, but can have large bandwidth.

Size of die vs. distance to memory: big!

Dynamic RAM: long intrinsic latency!

Idea:

Put a look-up table ofrecently-used data ontothe chip.

→ “Cache”

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 61: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Source of Slowness: MemoryMemory is slow.

Distinguish two different versions of “slow”:• Bandwidth• Latency

→ Memory has long latency, but can have large bandwidth.

Size of die vs. distance to memory: big!

Dynamic RAM: long intrinsic latency!

Idea:

Put a look-up table ofrecently-used data ontothe chip.

→ “Cache”

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 62: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

The Memory Hierarchy

Hierarchy of increasingly bigger, slower memories:

Registers

L1 Cache

L2 Cache

DRAM

Virtual Memory

(hard drive)

1 kB, 1 cycle

10 kB, 10 cycles

1 MB, 100 cycles

1 GB, 1000 cycles

1 TB, 1 M cycles

How might data localityfactor into this?

What is a working set?

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

faster

bigger

Page 63: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Enti

re p

rob

lem

fit

s w

ith

in r

egis

ters

Enti

re p

rob

lem

fit

s w

ith

in c

ach

e

Enti

re p

rob

lem

fits

wit

hin

mai

n m

emo

ry

Pro

ble

mre

qu

ires

seco

nd

ary

(dis

k)m

emo

ry

Problem too bigfor system!P

erf

orm

an

ce o

f co

mp

ute

r sy

ste

m

Size of problem being solved

Pe

rfo

rma

nce

of

com

pu

ter

syst

em

Size of problem being solved

Figure 6: Hypothetical model of performance of a computer having a hierarchy of

memory systems (registers, cache, main memory, and disk).

15

from Scott et al. “Scientific Parallel Computing” (2005)

Impact on Performance

Page 64: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

The Memory Hierarchy

Hierarchy of increasingly bigger, slower memories:

Registers

L1 Cache

L2 Cache

DRAM

Virtual Memory

(hard drive)

1 kB, 1 cycle

10 kB, 10 cycles

1 MB, 100 cycles

1 GB, 1000 cycles

1 TB, 1 M cyclesHow might data localityfactor into this?

What is a working set?

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 65: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Cache: Actual Implementation

Demands on cache implementation:

• Fast, small, cheap, low-power

• Fine-grained

• High “hit”-rate (few “misses”)

Problem:Goals at odds with each other: Access matching logic expensive!

Solution 1: More data per unit of access matching logic→ Larger “Cache Lines”

Solution 2: Simpler/less access matching logic→ Less than full “Associativity”

Other choices: Eviction strategy, size

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 66: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Cache: Associativity

Direct Mapped

Memory

0

1

2

3

4

5

6...

Cache

0

1

2

3

2-way set associative

Memory

0

1

2

3

4

5

6...

Cache

0

1

2

3

Miss rate versus cache size on the Integer por-

tion of SPEC CPU2000 [Cantin, Hill 2003]

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 67: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Cache: Associativity

Direct Mapped

Memory

0

1

2

3

4

5

6...

Cache

0

1

2

3

2-way set associative

Memory

0

1

2

3

4

5

6...

Cache

0

1

2

3

Miss rate versus cache size on the Integer por-

tion of SPEC CPU2000 [Cantin, Hill 2003]

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 68: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Cache Example: Intel Q6600/Core2 Quad

--- L1 data cache ---fully associative cache = falsethreads sharing this cache = 0x0 (0)processor cores on this die= 0x3 (3)system coherency line size = 0x3f (63)ways of associativity = 0x7 (7)number of sets - 1 (s) = 63

--- L1 instruction ---fully associative cache = falsethreads sharing this cache = 0x0 (0)processor cores on this die= 0x3 (3)system coherency line size = 0x3f (63)ways of associativity = 0x7 (7)number of sets - 1 (s) = 63

--- L2 unified cache ---fully associative cache falsethreads sharing this cache = 0x1 (1)processor cores on this die= 0x3 (3)system coherency line size = 0x3f (63)ways of associativity = 0xf (15)number of sets - 1 (s) = 4095

More than you care to know about your CPU:http://www.etallen.com/cpuid.html

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 69: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Measuring the Cache I

void go(unsigned count, unsigned stride){const unsigned arr size = 64 ∗ 1024 ∗ 1024;int ∗ary = (int ∗) malloc(sizeof( int) ∗ arr size );

for (unsigned it = 0; it < count; ++it){for (unsigned i = 0; i < arr size ; i += stride)ary [ i ] ∗= 17;

}

free (ary );}

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 70: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Measuring the Cache I

void go(unsigned count, unsigned stride){const unsigned arr size = 64 ∗ 1024 ∗ 1024;int ∗ary = (int ∗) malloc(sizeof( int) ∗ arr size );

for (unsigned it = 0; it < count; ++it){for (unsigned i = 0; i < arr size ; i += stride)ary [ i ] ∗= 17;

}

free (ary );}

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 71: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Measuring the Cache II

void go(unsigned array size , unsigned steps){int ∗ary = (int ∗) malloc(sizeof( int) ∗ array size );unsigned asm1 = array size − 1;

for (unsigned i = 0; i < steps; ++i)ary [( i∗16) & asm1] ++;

free (ary );}

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 72: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Measuring the Cache II

void go(unsigned array size , unsigned steps){int ∗ary = (int ∗) malloc(sizeof( int) ∗ array size );unsigned asm1 = array size − 1;

for (unsigned i = 0; i < steps; ++i)ary [( i∗16) & asm1] ++;

free (ary );}

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 73: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Measuring the Cache III

void go(unsigned array size , unsigned stride , unsigned steps){char ∗ary = (char ∗) malloc(sizeof( int) ∗ array size );

unsigned p = 0;for (unsigned i = 0; i < steps; ++i){ary [p] ++;p += stride;if (p >= array size)p = 0;

}

free (ary );}

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 74: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Measuring the Cache III

void go(unsigned array size , unsigned stride , unsigned steps){char ∗ary = (char ∗) malloc(sizeof( int) ∗ array size );

unsigned p = 0;for (unsigned i = 0; i < steps; ++i){ary [p] ++;p += stride;if (p >= array size)p = 0;

}

free (ary );}

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 75: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Mike Bauer (Stanford)

Page 76: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Tue 4/5/11: Guest Lecture by Mike Bauer (Stanford)

http://sequoia.stanford.edu/

Page 77: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Architecture

• What’s in a (basic) computer?

• Basic Subsystems

• Machine Language

• Memory Hierarchy

• Pipelines

• CPUs to GPUs

Page 78: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Source of Slowness: Sequential Operation

IF Instruction fetch

ID Instruction Decode

EX Execution

MEM Memory Read/Write

WB Result Writeback

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 79: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Solution: Pipelining

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 80: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Pipelining

(MIPS, 110,000 transistors)

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 81: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Issues with Pipelines

Pipelines generally helpperformance–but not always.

Possible issues:

• Stalls

• Dependent Instructions

• Branches (+Prediction)

• Self-Modifying Code

“Solution”: Bubbling, extracircuitry

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 82: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intel Q6600 Pipeline

New concept:Instruction-levelparallelism(“Superscalar”)

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 83: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intel Q6600 PipelineNew concept:Instruction-levelparallelism(“Superscalar”)

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 84: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Programming for the Pipeline

How to upset a processor pipeline:

for ( int i = 0; i < 1000; ++i)

for ( int j = 0; j < 1000; ++j)

{if ( j % 2 == 0)

do something(i , j );

}

. . . why is this bad?

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 85: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

A Puzzle

int steps = 256 ∗ 1024 ∗ 1024;int [] a = new int[2];

// Loop 1for ( int i=0; i<steps; i++) { a[0]++; a[0]++; }

// Loop 2for ( int i=0; i<steps; i++) { a[0]++; a[1]++; }

Which is faster?

. . . and why?

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 86: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Two useful Strategies

Loop unrolling:

for ( int i = 0; i < 1000; ++i)do something(i );

→for ( int i = 0; i < 500; i+=2){do something(i );do something(i+1);

}

Software pipelining:

for ( int i = 0; i < 1000; ++i){do a( i );do b(i );

}

for ( int i = 0; i < 500; i+=2){do a( i );do a( i+1);do b(i );do b(i+1);

}

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 87: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

SIMD

Control Units are large and expensive.

Functional Units are simple and cheap.

→ Increase the Function/Control ratio:

Control several functional units with

one control unit.

All execute same operation.

Dat

a Po

ol

Instruction PoolSIMD

GCC vector extensions:

typedef int v4si attribute (( vector size (16)));

v4si a, b, c;

c = a + b;

// +, −, ∗, /, unary minus, ˆ, |, &, ˜, %

Will revisit for OpenCL, GPUs.

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

Page 88: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Architecture

• What’s in a (basic) computer?

• Basic Subsystems

• Machine Language

• Memory Hierarchy

• Pipelines

• CPUs to GPUs

Page 89: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

EFG$F/$$0

&

! !"#$%&$'()*(+,-.'/

!012('&.*2(3'45&*)&6,7'&"2'89':&%;<=&;>6?&;*2(4'& !012('&.*2(3'45&*)&6,7'&"2'89':&%;<=&;>6?&;*2(4'&

! 6'401-'@&)*(&+,3AB0-3'-407':&C,(,DD'D&

C(*8D'+4/

! E*('&3(,-4043*(4&@'@0.,3'@&3*&?">&3A,-&)D*F&

.*-3(*D&,-@&@,3,&.,.A'

! GA,3&,('&3A'&.*-4'H2'-.'4I

! GA,3&,('&3A'&.*-4'H2'-.'4I

! $(*1(,+&+243&8'&+*('&C('@0.3,8D'/

! 6,3,&,..'44&.*A'('-.5

! $(*1(,+&)D*F

GPUs ?

Page 90: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

“CPU-style” Cores

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

CPU-“style” cores

ALU (Execute)

Fetch/ Decode

Execution Context

Out-of-order control logic

Fancy branch predictor

Memory pre-fetcher

Data cache (A big one)

13

Credit: Kayvon Fatahalian (Stanford)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 91: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Slimming down

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Slimming down

ALU (Execute)

Fetch/ Decode

Execution Context

Idea #1:

Remove components that help a single instruction stream run fast

14

Credit: Kayvon Fatahalian (Stanford)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 92: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

More Space: Double the Number of Cores

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Two cores (two fragments in parallel)

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

!"#$$%&'()*"'+,-.

&*/01'.+23.453.623.&2.

/%1..+73.423.892:2;.

/*"".+73.4<3.892:<;3.+7.

/*"".+73.4=3.892:=;3.+7.

81/0.+73.+73.1>[email protected]><?2@.

/%1..A23.+23.+7.

/%1..A<3.+<3.+7.

/%1..A=3.+=3.+7.

/A4..A73.1><?2@.

fragment 1

!"#$$%&'()*"'+,-.

&*/01'.+23.453.623.&2.

/%1..+73.423.892:2;.

/*"".+73.4<3.892:<;3.+7.

/*"".+73.4=3.892:=;3.+7.

81/0.+73.+73.1>[email protected]><?2@.

/%1..A23.+23.+7.

/%1..A<3.+<3.+7.

/%1..A=3.+=3.+7.

/A4..A73.1><?2@.

fragment 2

15

Credit: Kayvon Fatahalian (Stanford)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 93: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

. . . again

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Four cores (four fragments in parallel)

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

16

Credit: Kayvon Fatahalian (Stanford)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 94: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

. . . and again

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Sixteen cores (sixteen fragments in parallel)

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

16 cores = 16 simultaneous instruction streams 17 Credit: Kayvon Fatahalian (Stanford)

→ 16 independent instruction streams

Reality: instruction streams not actuallyvery different/independent

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 95: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

. . . and again

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Sixteen cores (sixteen fragments in parallel)

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

16 cores = 16 simultaneous instruction streams 17 Credit: Kayvon Fatahalian (Stanford)

→ 16 independent instruction streams

Reality: instruction streams not actuallyvery different/independent

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 96: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Saving Yet More Space

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Recall: simple processing core

Fetch/ Decode

ALU (Execute)

Execution Context

19 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Recall: simple processing core

Fetch/ Decode

ALU (Execute)

Execution Context

19

Idea #2

Amortize cost/complexity ofmanaging an instruction streamacross many ALUs

→ SIMD

Credit: Kayvon Fatahalian (Stanford)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 97: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Saving Yet More Space

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Recall: simple processing core

Fetch/ Decode

ALU (Execute)

Execution Context

19 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Recall: simple processing core

Fetch/ Decode

ALU (Execute)

Execution Context

19

Idea #2

Amortize cost/complexity ofmanaging an instruction streamacross many ALUs

→ SIMD

Credit: Kayvon Fatahalian (Stanford)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 98: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Saving Yet More Space

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Add ALUs

Fetch/ Decode

Idea #2:

Amortize cost/complexity of managing an instruction stream across many ALUs

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

SIMD processing Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

20 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Recall: simple processing core

Fetch/ Decode

ALU (Execute)

Execution Context

19

Idea #2

Amortize cost/complexity ofmanaging an instruction streamacross many ALUs

→ SIMD

Credit: Kayvon Fatahalian (Stanford)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 99: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Saving Yet More Space

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Add ALUs

Fetch/ Decode

Idea #2:

Amortize cost/complexity of managing an instruction stream across many ALUs

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

SIMD processing Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

20 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Add ALUs

Fetch/ Decode

Idea #2:

Amortize cost/complexity of managing an instruction stream across many ALUs

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

SIMD processing Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

20

Idea #2

Amortize cost/complexity ofmanaging an instruction streamacross many ALUs

→ SIMD

Credit: Kayvon Fatahalian (Stanford)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 100: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Gratuitous Amounts of Parallelism!

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

128 fragments in parallel

= 16 simultaneous instruction streams 16 cores = 128 ALUs

24 Credit: Kayvon Fatahalian (Stanford)

Example:

128 instruction streams in parallel16 independent groups of 8 synchronized streams

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

http://www.youtube.com/watch?v=1yH_j8-VVLo

Page 101: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Gratuitous Amounts of Parallelism!

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

128 fragments in parallel

= 16 simultaneous instruction streams 16 cores = 128 ALUs

24 Credit: Kayvon Fatahalian (Stanford)

Example:

128 instruction streams in parallel16 independent groups of 8 synchronized streams

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

http://www.youtube.com/watch?v=1yH_j8-VVLo

slide by

Page 102: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Remaining Problem: Slow Memory

Problem

Memory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.

We’ve removed

caches

branch prediction

out-of-order execution

So what now?

Idea #3

Even more parallelism+ Some extra memory= A solution!

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 103: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Remaining Problem: Slow Memory

Problem

Memory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.

We’ve removed

caches

branch prediction

out-of-order execution

So what now?

Idea #3

Even more parallelism+ Some extra memory= A solution!

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 104: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Remaining Problem: Slow Memory

Problem

Memory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.

We’ve removed

caches

branch prediction

out-of-order execution

So what now?SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls Time

(clocks) Frag 1 … 8

Fetch/ Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

ALU ALU ALU ALU

ALU ALU ALU ALU

33

Idea #3

Even more parallelism+ Some extra memory= A solution!

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 105: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Remaining Problem: Slow Memory

Problem

Memory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.

We’ve removed

caches

branch prediction

out-of-order execution

So what now?SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls Time

(clocks)

Fetch/ Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

1 2

3 4

1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

34

Idea #3

Even more parallelism+ Some extra memory= A solution!

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 106: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

GPU Architecture Summary

Core Ideas:

1 Many slimmed down cores→ lots of parallelism

2 More ALUs, Fewer Control Units

3 Avoid memory stalls by interleavingexecution of SIMD groups(“warps”)

Credit: Kayvon Fatahalian (Stanford)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 107: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

EFG$F/$$0

&

! !"#$%&$'()*(+,-.'/

!012('&.*2(3'45&*)&6,7'&"2'89':&%;<=&;>6?&;*2(4'& !012('&.*2(3'45&*)&6,7'&"2'89':&%;<=&;>6?&;*2(4'&

! 6'401-'@&)*(&+,3AB0-3'-407':&C,(,DD'D&

C(*8D'+4/

! E*('&3(,-4043*(4&@'@0.,3'@&3*&?">&3A,-&)D*F&

.*-3(*D&,-@&@,3,&.,.A'

! GA,3&,('&3A'&.*-4'H2'-.'4I

! GA,3&,('&3A'&.*-4'H2'-.'4I

! $(*1(,+&+243&8'&+*('&C('@0.3,8D'/

! 6,3,&,..'44&.*A'('-.5

! $(*1(,+&)D*F

slide by Matthew Bolitho

Is it free?

Page 108: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Some terminology

Some More TerminologyOne way to classify machines distinguishes betweenshared memory global memory can be acessed by all processors orcores. Information exchanged between threads using shared variableswritten by one thread and read by another. Need to coordinate access toshared variables.distributed memory private memory for each processor, only accessiblethis processor, so no synchronization for memory accesses needed.Information exchanged by sending data from one processor to anothervia an interconnection network using explicit communication operations.

Interconnection Network

P PP

M M M

Interconnection Network

P PP

M M M

Hybrid approach increasingly common

Some More TerminologyOne way to classify machines distinguishes betweenshared memory global memory can be acessed by all processors orcores. Information exchanged between threads using shared variableswritten by one thread and read by another. Need to coordinate access toshared variables.distributed memory private memory for each processor, only accessiblethis processor, so no synchronization for memory accesses needed.Information exchanged by sending data from one processor to anothervia an interconnection network using explicit communication operations.

Interconnection Network

P PP

M M M

Interconnection Network

P PP

M M M

Hybrid approach increasingly commonnow: mostly hybrid

“distributed memory” “shared memory”

Page 109: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Some terminologySome More Terminology

One way to classify machines distinguishes betweenshared memory global memory can be acessed by all processors orcores. Information exchanged between threads using shared variableswritten by one thread and read by another. Need to coordinate access toshared variables.distributed memory private memory for each processor, only accessiblethis processor, so no synchronization for memory accesses needed.Information exchanged by sending data from one processor to anothervia an interconnection network using explicit communication operations.

Interconnection Network

P PP

M M M

Interconnection Network

P PP

M M M

Hybrid approach increasingly common

Some More TerminologyOne way to classify machines distinguishes betweenshared memory global memory can be acessed by all processors orcores. Information exchanged between threads using shared variableswritten by one thread and read by another. Need to coordinate access toshared variables.distributed memory private memory for each processor, only accessiblethis processor, so no synchronization for memory accesses needed.Information exchanged by sending data from one processor to anothervia an interconnection network using explicit communication operations.

Interconnection Network

P PP

M M M

Interconnection Network

P PP

M M M

Hybrid approach increasingly common

Page 110: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Programming Model(Overview)

Page 111: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

GPU Architecture

CUDA Programming Model

Page 112: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axis1

HardwareSoftware representation

?

Really: Group providespool of parallelism to drawfrom.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 113: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axis1

HardwareSoftware representation

?

Really: Group providespool of parallelism to drawfrom.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 114: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axis1

HardwareSoftware representation

?

Really: Group providespool of parallelism to drawfrom.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 115: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axis1

HardwareSoftware representation

?

Really: Group providespool of parallelism to drawfrom.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 116: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axis1

HardwareSoftware representation

Grid

(Kernel: Func-

tion on Grid)

(Work) Group

(Work) Item

?

Really: Group providespool of parallelism to drawfrom.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

or “Block”

or “Thread”

Page 117: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axis1

HardwareSoftware representation

Grid

(Kernel: Func-

tion on Grid)

?

Really: Group providespool of parallelism to drawfrom.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 118: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axis1

HardwareSoftware representation

Grid

(Kernel: Func-

tion on Grid)

(Work) Group

?

Really: Group providespool of parallelism to drawfrom.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

or “Block”

Page 119: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axis1

HardwareSoftware representation

?

Really: Group providespool of parallelism to drawfrom.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 120: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axis1

HardwareSoftware representation

?

Really: Group providespool of parallelism to drawfrom.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 121: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axis1

HardwareSoftware representation

?

Really: Group providespool of parallelism to drawfrom.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 122: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axis1

HardwareSoftware representation

?

Really: Group providespool of parallelism to drawfrom.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 123: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axis1

HardwareSoftware representation

?

Really: Group providespool of parallelism to drawfrom.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 124: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axis1

HardwareSoftware representation

?

Really: Group providespool of parallelism to drawfrom.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 125: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axis1

HardwareSoftware representation

?

Really: Group providespool of parallelism to drawfrom.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 126: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axis1

HardwareSoftware representation

?

Really: Group providespool of parallelism to drawfrom.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 127: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axis1

HardwareSoftware representation

?

Really: Group providespool of parallelism to drawfrom.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 128: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axis1

HardwareSoftware representation

?

Really: Group providespool of parallelism to drawfrom.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Block

block

Page 129: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Intro PyOpenCL What and Why? OpenCL

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axis1

HardwareSoftware representation

?

Really: Group providespool of parallelism to drawfrom.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by

Page 130: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

more next time ;-)

Page 131: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Bits of Theory(or “common sense”)

Page 132: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Speedup

• T(1): Performance of best serial algorithm

• p: Number of processors

S(p) =T (1)T (p)

S(p) ≤ p

Peter Arbenz, Andreas Adelmann, ETH Zurich

Page 133: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Efficiency

• Fraction of time for which a processor does useful work

• means

E(p) =S(p)

p=

T (1)pT (p)

E(p) ≤ 1S(p) ≤ p

Peter Arbenz, Andreas Adelmann, ETH Zurich

Page 134: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Amdahl’s Law

Peter Arbenz, Andreas Adelmann, ETH Zurich

• : Fraction of program that is sequential

• Assumes that the non-sequential portion of the program parallelizes optimally

T (p) =�

α +1− α

p

�T (1)

α

Page 135: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Example

• Sequential portion: 10 sec

• Parallel portion: 990 sec

• What is the maximal speedup as ? p→∞

Page 136: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Solution

• Sequential fraction of the code:

• Amdahl’s Law:

• Speedup as

1010 + 990

=1

100= 1%

T (p) =�

0.01 +0.99p

�T (1)

p→∞

S(p) =T (1)T (p)

→ 1α

= 100

Page 137: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
Page 138: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Arithmetic Intensity

• : computational Work in floating-point operations

• : number of Memory accesses (read and write)

• Memory access is the critical issue!

Page 139: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Example4.1 Memory effects

Memory access is the critical issue in high-performance computing.

Definition 4.2 The work/memory ratio !WM: number of floating-point operations

divided by number of memory locations referenced (either reads or writes).

A look at a book of mathematical tables tells us that

"

4= 1 ! 1

3+

1

5! 1

7+

1

9! 1

11+

1

13! 1

15+ · · · (4.1)

Slowly converging series good example for studying basic operation of computing

the sum of a series of numbers:

A =N!

i=1

ai. (4.2)

Computation of A in equation (4.2) requires N ! 1 floating-point additions and

involves N + 1 memory locations: one for A and n for the ai’s.

Therefore, work/memory ratio for this algorithm is !WM = (N ! 1)/(N + 1) " 1

for large N .

12

Page 140: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

0 5 10 15 20 25 300

5

10

15

20

25

30Speed!up of simple Pi summation

Number of Processors

Sp

ee

d!

up

Figure 9: Hypothetical performance of a parallel implementation of summation:

speed-up.

28

Why?

from Scott et al. “Scientific Parallel Computing” (2005)

Page 141: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

0 5 10 15 20 25 300.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1Parallel efficiency of simple Pi summation

Number of Processors

Eff

icie

ncy

Figure 10: Hypothetical performance of a parallel implementation of summation:

efficiency.

31

Why?

from Scott et al. “Scientific Parallel Computing” (2005)

Page 142: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Example

Computationdone here Pathway to memory

Main datastored here

Figure 4: A simple memory model with a computational unit with only a small

amount of local memory (not shown) separated from the main memory by a path-

way with limited bandwidth µ.

Theorem 4.1 Suppose that a given algorithm has a work/memory ratio !WM, and

it is implemented on a system as depicted in Figure 4 with a maximum bandwidth

to memory of µ billion floating-point words per second. Then the maximum

performance that can be achieved is µ!WM GFLOPS.

Theorem 4.1 provides an upper bound on the number of operations per unit time,

by assuming the floating-point operation blocks until data are available to the cpu.

Therefore the cpu cannot proceed faster than the rate data are supplied, and it

might proceed slower.

13

• Q: How many float32 ops / sec maximum ?

• Processing unit can’t be faster than the rate data are supplied, and it might be slower

Bandwidth = 1 Gbyte / sec

Page 143: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Better?

Computationdone here Pathway to memory

Main datastored here

Local data cache here

Local data cache here

Figure 5: A memory model with a large local data cache separated from the main

memory by a pathway with limited bandwidth µ.

The performance of a two-level memory model (as depicted in Figure 5)

consisting of a cache and a main memory can be modeled simplistically as

average cycles

word access=%hits! cache cycles

word access

+ (1 - %hits)! main memory cycles

word access,

(4.3)

where %hits is the fraction of cache hits among all memory references.

Figure 6 indicates the performance of a hypothetical application, depicting a

decrease in performance as a problem increases in size and migrates into ever

slower memory systems. Eventually the problem size reaches a point where it can

not ever be completed for lack of memory.

14

• Yes? In theory... Why?

• No? Why?

Page 144: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Cache Performance

Computationdone here Pathway to memory

Main datastored here

Local data cache here

Local data cache here

Figure 5: A memory model with a large local data cache separated from the main

memory by a pathway with limited bandwidth µ.

The performance of a two-level memory model (as depicted in Figure 5)

consisting of a cache and a main memory can be modeled simplistically as

average cycles

word access=%hits! cache cycles

word access

+ (1 - %hits)! main memory cycles

word access,

(4.3)

where %hits is the fraction of cache hits among all memory references.

Figure 6 indicates the performance of a hypothetical application, depicting a

decrease in performance as a problem increases in size and migrates into ever

slower memory systems. Eventually the problem size reaches a point where it can

not ever be completed for lack of memory.

14

Computationdone here Pathway to memory

Main datastored here

Local data cache here

Local data cache here

Figure 5: A memory model with a large local data cache separated from the main

memory by a pathway with limited bandwidth µ.

The performance of a two-level memory model (as depicted in Figure 5)

consisting of a cache and a main memory can be modeled simplistically as

average cycles

word access=%hits! cache cycles

word access

+ (1 - %hits)! main memory cycles

word access,

(4.3)

where %hits is the fraction of cache hits among all memory references.

Figure 6 indicates the performance of a hypothetical application, depicting a

decrease in performance as a problem increases in size and migrates into ever

slower memory systems. Eventually the problem size reaches a point where it can

not ever be completed for lack of memory.

14

from Scott et al. “Scientific Parallel Computing” (2005)

Page 145: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

COMP 322, Fall 2009 (V.Sarkar)!16

Cache Performance

16

!"#"$%$&'"()(*+,%-",.("

!*/*'("#",.("0%1"&"23-4'("51%*(22%1"*/*'("

6*%+-$"#"$%$&'"-+.7(1"%0"3-2$1+*,%-2"689:"#"-+.7(1"%0"89:"3-2$1+*,%-2";(<4<"1(432$(1"="1(432$(1>"

6?@?"#"-+.7(1"%0".(.%1/"&**(22"3-2$1+*,%-2";"(<4<"'%&AB"2$%1(>"

CD6"#"&E(1&4("*/*'(2"5(1"3-2$1+*,%-2"

CD689:"#"&E(1&4("*/*'(2"5(1"89:"3-2$1+*,%-2"

CD6?@?"#"&E(1&4("*/*'(2"5(1".(.%1/"3-2$1+*,%-"

1.322"#"*&*F(".322"1&$("

1F3$"#"*&*F("F3$"1&$("CD6?@?G?6HH"#"*/*'(2"5(1"*&*F(".322"

CD6?@?GI6!#*/*'(2"5(1"*&*F("F3$"

?89:"#"3-2$1+*,%-".3)"0%1"89:"3-2$1+*,%-2"

??@?"#"3-2$1+*,%-".3)"0%1".(.%1/"&**(22"3-2$1+*,%-"

Cache Performance

from V. Sarkar (COMP 322, 2009)

Page 146: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

COMP 322, Fall 2009 (V.Sarkar)!17

Cache Performance: Example

17

from V. Sarkar (COMP 322, 2009)

Cache Performance

Page 147: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Parallel Complexity

COMP 322, Fall 2009 (V.Sarkar)!15

Algorithmic Complexity Measures!

TP = execution time on P processors Computation graph abstraction: •! Node = arbitrary sequential computation •! Edge = dependence (successor node can only execute after predecessor node has completed) •! Directed acyclic graph (dag)

Processor abstraction: •! P identical processors •! Each processor executes one node at a time PROC0 PROCP-1 . . .

= execution time on processors

Computation graph abstraction (DAG):Node: arbitrary sequential computationEdge: dependence

Assume: identical processorsexecuting one node at a time

adapted from V. Sarkar (COMP 322, 2009)

Page 148: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Parallel Complexity

COMP 322, Fall 2009 (V.Sarkar)!15

Algorithmic Complexity Measures!

TP = execution time on P processors Computation graph abstraction: •! Node = arbitrary sequential computation •! Edge = dependence (successor node can only execute after predecessor node has completed) •! Directed acyclic graph (dag)

Processor abstraction: •! P identical processors •! Each processor executes one node at a time PROC0 PROCP-1 . . .

= execution time on processors

COMP 322, Fall 2009 (V.Sarkar)!16

Algorithmic Complexity Measures!

TP = execution time on P processors

T1 = work “work complexity”

total number of operations performed

adapted from V. Sarkar (COMP 322, 2009)

Page 149: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Parallel Complexity

COMP 322, Fall 2009 (V.Sarkar)!15

Algorithmic Complexity Measures!

TP = execution time on P processors Computation graph abstraction: •! Node = arbitrary sequential computation •! Edge = dependence (successor node can only execute after predecessor node has completed) •! Directed acyclic graph (dag)

Processor abstraction: •! P identical processors •! Each processor executes one node at a time PROC0 PROCP-1 . . .

= execution time on processors

COMP 322, Fall 2009 (V.Sarkar)!17 July 13, 2006 17

Algorithmic Complexity Measures!

TP = execution time on P processors

T1 = work

T! = span*

*Also called critical-path length or computational depth.

COMP 322, Fall 2009 (V.Sarkar)!17 July 13, 2006 17

Algorithmic Complexity Measures!

TP = execution time on P processors

T1 = work

T! = span*

*Also called critical-path length or computational depth.

* also called:critical path length or computational depth

COMP 322, Fall 2009 (V.Sarkar)!17 July 13, 2006 17

Algorithmic Complexity Measures!

TP = execution time on P processors

T1 = work

T! = span*

*Also called critical-path length or computational depth.

“work complexity”

“step complexity”

minimum number of steps when

adapted from V. Sarkar (COMP 322, 2009)

Page 150: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Parallel Complexity

COMP 322, Fall 2009 (V.Sarkar)!15

Algorithmic Complexity Measures!

TP = execution time on P processors Computation graph abstraction: •! Node = arbitrary sequential computation •! Edge = dependence (successor node can only execute after predecessor node has completed) •! Directed acyclic graph (dag)

Processor abstraction: •! P identical processors •! Each processor executes one node at a time PROC0 PROCP-1 . . .

= execution time on processors

Lower bounds:

adapted from V. Sarkar (COMP 322, 2009)

Page 151: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Parallel Complexity= execution time on processors

Parallelism (i.e ideal speed-up):

COMP 322, Fall 2009 (V.Sarkar)!19

Speedup!

If T1/TP = !(P), we have linear speedup; = P, we have perfect linear speedup; > P, we have superlinear speedup,

Superlinear speedup is not possible in this model because of the lower bound TP ! T1/P, but superlinear speedup can be possible in practice (as we will see later in the course)

COMP 322, Fall 2009 (V.Sarkar)!17 July 13, 2006 17

Algorithmic Complexity Measures!

TP = execution time on P processors

T1 = work

T! = span*

*Also called critical-path length or computational depth.

adapted from V. Sarkar (COMP 322, 2009)

Page 152: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

ExampleArray Sum: Sequential Version

COMP 322, Fall 2009 (V.Sarkar)!21

Example 1: Array Sum !(sequential version)"

•! Problem: compute the sum of the elements X[0] … X[n-1] of array X

•! Sequential algorithm —! sum = 0; for ( i=0 ; i< n ; i++ ) sum += X[i];

•! Computation graph

—! Work = O(n), Span = O(n), Parallelism = O(1)

•! How can we design an algorithm (computation graph) with more parallelism?

+

+

+

X[0]

X[1]

X[2]

0

COMP 322, Fall 2009 (V.Sarkar)!21

Example 1: Array Sum !(sequential version)"

•! Problem: compute the sum of the elements X[0] … X[n-1] of array X

•! Sequential algorithm —! sum = 0; for ( i=0 ; i< n ; i++ ) sum += X[i];

•! Computation graph

—! Work = O(n), Span = O(n), Parallelism = O(1)

•! How can we design an algorithm (computation graph) with more parallelism?

+

+

+

X[0]

X[1]

X[2]

0

adapted from V. Sarkar (COMP 322, 2009)

Page 153: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

ExampleArray Sum: Parallel Iterative Version

adapted from V. Sarkar (COMP 322, 2009)

COMP 322, Fall 2009 (V.Sarkar)!23

Example 1: Array Sum !(parallel iterative version)"

•! Computation graph for n = 8

•! Work = O(n), Span = O(log n), Parallelism = O( n / (log n) )

+

X[2] X[3]

+

X[0] X[1]

+

X[4] X[5]

+

X[6] X[7]

X[0] X[2] X[4] X[6]

+ +

X[0]

X[4]

+

X[0]

Extra dependence edges due to forall construct

Page 154: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

ExampleArray Sum: Parallel Recursive Version

adapted from V. Sarkar (COMP 322, 2009)

COMP 322, Fall 2009 (V.Sarkar)!25

Example 1: Array Sum !(parallel recursive version)"

•! Computation graph for n = 8

•! Work = O(n), Span = O(log n), Parallelism = O( n / (log n) )

•! No extra dependences as in forall case

+

X[2] X[3]

+

X[0] X[1]

+

X[4] X[5]

+

X[6] X[7]

+ +

+

Page 155: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Patterns

Page 156: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Task vs Data Parallelism

Page 157: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Task parallelism

• Distribute the tasks across processors based on dependency

• Coarse-grain parallelism

157

Task 1Task 2

Task 4Task 5 Task 6

Task 7 Task 8Task 9

Task 3

Task dependency graph

Task assignment across 3 processors

Task 1

Task 4

Task 7

Task 5

Task 8

Task 2

Task 6

Task 3

Task 9

P1

P2

P3

Time

Page 158: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Data parallelism

• Run a single kernel over many elements–Each element is independently updated–Same operation is applied on each element

• Fine-grain parallelism–Many lightweight threads, easy to switch context–Maps well to ALU heavy architecture : GPU

158

Kernel P1 P2 P3 P4 P5 Pn…….

…….Data

Page 159: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

4

Task vs. Data parallelismTask vs. Data parallelism

• Task parallel

– Independent processes with little communication

– Easy to use

• “Free” on modern operating systems with SMP

• Data parallel

– Lots of data on which the same computation is being

executed

– No dependencies between data elements in each

step in the computation

– Can saturate many ALUs

– But often requires redesign of traditional algorithms

slide by Mike Houston

Page 160: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

5

CPU vs. GPUCPU vs. GPU

• CPU

– Really fast caches (great for data reuse)

– Fine branching granularity

– Lots of different processes/threads

– High performance on a single thread of execution

• GPU

– Lots of math units

– Fast access to onboard memory

– Run a program on each fragment/vertex

– High throughput on parallel tasks

• CPUs are great for task parallelism

• GPUs are great for data parallelismslide by Mike Houston

Page 161: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

GPU-friendly Problems

• Data-parallel processing• High arithmetic intensity

–Keep GPU busy all the time–Computation offsets memory latency

• Coherent data access–Access large chunk of contiguous memory–Exploit fast on-chip shared memory

161

Page 162: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

The Algorithm Matters

• Jacobi: Parallelizable

for(int i=0; i<num; i++) { vn+1[i] = (vn[i-1] + vn[i+1])/2.0; }

• Gauss-Seidel: Difficult to parallelize

for(int i=0; i<num; i++) { v[i] = (v[i-1] + v[i+1])/2.0; }

162

Page 163: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Example: Reduction

• Serial version (O(N)) for(int i=1; i<N; i++) { v[0] += v[i]; }

• Parallel version (O(logN)) width = N/2; while(width > 1) { for(int i=0; i<width; i++) { v[i] += v[i+width]; // computed in parallel } width /= 2; }

163

Page 164: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

6

The Importance of Data Parallelism for GPUsThe Importance of Data Parallelism for GPUs

• GPUs are designed for highly parallel tasks like

rendering

• GPUs process independent vertices and fragments

– Temporary registers are zeroed

– No shared or static data

– No read-modify-write buffers

– In short, no communication between vertices or fragments

• Data-parallel processing

– GPU architectures are ALU-heavy

• Multiple vertex & pixel pipelines

• Lots of compute power

– GPU memory systems are designed to stream data

• Linear access patterns can be prefetched

• Hide memory latency slide by Mike Houston

GPUs

Page 165: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

F

!"!# !"$#

$"!# $"$#

!%&'() $*(+%,()

!%&'()

$*(+%,()

#-+-

"&.+/*0+%1& !"!# !"$#

$"!# $"$#

!%&'() $*(+%,()

!%&'()

$*(+%,()

#-+-

"&.+/*0+%1&

!"!# !"$#

$"!# $"$#

!%&'() $*(+%,()

!%&'()

$*(+%,()

#-+-

"&.+/*0+%1& !"!# !"$#

$"!# $"$#

!%&'() $*(+%,()

!%&'()

$*(+%,()

#-+-

"&.+/*0+%1&

!"!# !"$#

$"!# $"$#

!%&'() $*(+%,()

!%&'()

$*(+%,()

"&.+/*0+%1&

"&.+/*0+%1&

$"$#

!"#$%&'(%)*$+

!(, -.(/

0123$1453%&'(%)*$+

(,, 67523%$2

8+4$1& 9$1&

slide by Matthew Bolitho

Page 166: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Flynn’s TaxonomyEarly classification of parallel computing architectures given by M.Flynn (1972) using number of instruction streams and data streams.Still used.

• Single Instruction Single Data (SISD) conventional sequentialcomputer with one processor, single program and data storage.

• Multiple Instruction Single Data (MISD) used for fault tolerance(Space Shuttle) - from Wikipedia

• Single Instruction Multiple Data (SIMD) each processing elementuses same instruction applied synchronously in parallel todifferent data elements (Connection Machine, GPUs).If-then-else statements take two steps to execute.

• Multiple Instruction Multiple Data (MIMD) each processingelememt loads separate instrution and separate data elements;processors work asynchronously. Since 2006 top tensupercomputers of this type (w/o 10K node SGI Altix Columbiaat NASA Ames)

Update: Single Program Multiple Data (SPMD) autonomousprocessors executing same program but not in lockstep. Mostcommon style of programming. adapted from Berger & Klöckner (NYU 2010)

Page 167: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Finding Concurrency

Page 168: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

&

! !"#$%&'()'*$'+&',)($'',%$)-."#)$/.&0/1)

0#$."2'#'3&

! 45)$."$".&0"3)"-)$."6./#)&7/&)0()$/./11'1

! 85)($'',%$)"-)$/./11'1)$".&0"3)

! 9)('.0/1)/16".0&7#)+/3):')#/,')$/./11'1):;)

!"#$"#%&'!"#$%&'()$!*+%,+-..!,+/0

! <03,)-%3,/#'3&/1)$/.&()"-)&7')/16".0&7#)

&7/&)/.')('$/./:1'

slide by Matthew Bolitho

Page 169: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

#

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,#

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,#

! 896)0,-5*#%(".%:'%3'()*+)#'3%:7%:)-5%-"#$%

".3%3"-";

! !"#$;%<,.3%60)1+#%)=%,.#-01(-,).#%-5"-%(".%:'%

'>'(1-'3%,.%+"0"99'9

! %"&";%<,.3%+"0-,-,).#%,.%-5'%3"-"%-5"-%(".%:'%1#'3%

?0'9"-,@'97A%,.3'+'.3'.-97

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,#

! 8."97B'%-5'%"96)0,-5*%".3%=,.3%!"#$%&'#('

)*&+"$,+)#*& -5"-%"0'%?0'9"-,@'97A%,.3'+'.3'.-

! 8."97B'%-5'%"96)0,-5*%".3%=,.3%!"#$%&'#('

)*&+"$,+)#*& -5"-%"0'%?0'9"-,@'97A%,.3'+'.3'.-

! C6;%D"-0,>%D19-,+9,("-,).

! E)*+1-,.6%'"(5%'9'*'.-%)=%F%,#%"%3)-%+0)31(-

slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)

Page 170: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

#

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,#

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,#

! 896)0,-5*#%(".%:'%3'()*+)#'3%:7%:)-5%-"#$%

".3%3"-";

! !"#$;%<,.3%60)1+#%)=%,.#-01(-,).#%-5"-%(".%:'%

'>'(1-'3%,.%+"0"99'9

! %"&";%<,.3%+"0-,-,).#%,.%-5'%3"-"%-5"-%(".%:'%1#'3%

?0'9"-,@'97A%,.3'+'.3'.-97

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,#

! 8."97B'%-5'%"96)0,-5*%".3%=,.3%!"#$%&'#('

)*&+"$,+)#*& -5"-%"0'%?0'9"-,@'97A%,.3'+'.3'.-

! 8."97B'%-5'%"96)0,-5*%".3%=,.3%!"#$%&'#('

)*&+"$,+)#*& -5"-%"0'%?0'9"-,@'97A%,.3'+'.3'.-

! C6;%D"-0,>%D19-,+9,("-,).

! E)*+1-,.6%'"(5%'9'*'.-%)=%F%,#%"%3)-%+0)31(-

slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)

Page 171: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

#

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,#

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,#

! 896)0,-5*#%(".%:'%3'()*+)#'3%:7%:)-5%-"#$%

".3%3"-";

! !"#$;%<,.3%60)1+#%)=%,.#-01(-,).#%-5"-%(".%:'%

'>'(1-'3%,.%+"0"99'9

! %"&";%<,.3%+"0-,-,).#%,.%-5'%3"-"%-5"-%(".%:'%1#'3%

?0'9"-,@'97A%,.3'+'.3'.-97

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,#

! 8."97B'%-5'%"96)0,-5*%".3%=,.3%!"#$%&'#('

)*&+"$,+)#*& -5"-%"0'%?0'9"-,@'97A%,.3'+'.3'.-

! 8."97B'%-5'%"96)0,-5*%".3%=,.3%!"#$%&'#('

)*&+"$,+)#*& -5"-%"0'%?0'9"-,@'97A%,.3'+'.3'.-

! C6;%D"-0,>%D19-,+9,("-,).

! E)*+1-,.6%'"(5%'9'*'.-%)=%F%,#%"%3)-%+0)31(-

slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)

Page 172: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

'

! !"#$%&'()*'(#$+,-.)*/(#"0(1."0(!"#$%&'#('

)*&+"$,+)#*& )*#)(#-'(2-'$#).3'$%4(."0'5'"0'")

! 6+7(8,$'9:$#-(;%"#/.9<! =,/5:)'>.?-#).,"#$@,-9'<

! =,/5:)'A,)#).,"#$@,-9'<

! =,/5:)';.*'0-#$@,-9'<

! =,/5:)'B'.+*?,:-<

! =,/5:)'B,"C,"0."+@,-9'<

! D50#)'E,<.).,"<!"0>'$,9.).'<

F#<G(;'9,/5,<.).,"

;#)#(;'9,/5,<.).,"

H-,:5(F#<G<

I-0'-(F#<G<

;#)#(J*#-."+

;'9,/5,<.).," ;'5'"0'"9%(!"#$%<.<

! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(

%-"+)+)#*'+./'0-+-

! 6+7(8#)-.L(8:$).5$.9#).,"7(=,$:/"<(#"0(A,K<

1 2

! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(

%-"+)+)#*'+./'0-+-

! 6+7(8#)-.L(8:$).5$.9#).,"7(C$,9G<

1 2

! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(

%-"+)+)#*'+./'0-+-

! 6+7(8#)-.L(8:$).5$.9#).,"

1 2

! @."0(K#%<(),(5#-).).,"()*'(0#)#

! 6+7(8#)-.L(8:$).5$.9#).,"

1 2

slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)

Page 173: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

'

! !"#$%&'()*'(#$+,-.)*/(#"0(1."0(!"#$%&'#('

)*&+"$,+)#*& )*#)(#-'(2-'$#).3'$%4(."0'5'"0'")

! 6+7(8,$'9:$#-(;%"#/.9<! =,/5:)'>.?-#).,"#$@,-9'<

! =,/5:)'A,)#).,"#$@,-9'<

! =,/5:)';.*'0-#$@,-9'<

! =,/5:)'B'.+*?,:-<

! =,/5:)'B,"C,"0."+@,-9'<

! D50#)'E,<.).,"<!"0>'$,9.).'<

F#<G(;'9,/5,<.).,"

;#)#(;'9,/5,<.).,"

H-,:5(F#<G<

I-0'-(F#<G<

;#)#(J*#-."+

;'9,/5,<.).," ;'5'"0'"9%(!"#$%<.<

! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(

%-"+)+)#*'+./'0-+-

! 6+7(8#)-.L(8:$).5$.9#).,"7(=,$:/"<(#"0(A,K<

1 2

! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(

%-"+)+)#*'+./'0-+-

! 6+7(8#)-.L(8:$).5$.9#).,"7(C$,9G<

1 2

! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(

%-"+)+)#*'+./'0-+-

! 6+7(8#)-.L(8:$).5$.9#).,"

1 2

! @."0(K#%<(),(5#-).).,"()*'(0#)#

! 6+7(8#)-.L(8:$).5$.9#).,"

1 2

slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)

Page 174: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

'

! !"#$%&'()*'(#$+,-.)*/(#"0(1."0(!"#$%&'#('

)*&+"$,+)#*& )*#)(#-'(2-'$#).3'$%4(."0'5'"0'")

! 6+7(8,$'9:$#-(;%"#/.9<! =,/5:)'>.?-#).,"#$@,-9'<

! =,/5:)'A,)#).,"#$@,-9'<

! =,/5:)';.*'0-#$@,-9'<

! =,/5:)'B'.+*?,:-<

! =,/5:)'B,"C,"0."+@,-9'<

! D50#)'E,<.).,"<!"0>'$,9.).'<

F#<G(;'9,/5,<.).,"

;#)#(;'9,/5,<.).,"

H-,:5(F#<G<

I-0'-(F#<G<

;#)#(J*#-."+

;'9,/5,<.).," ;'5'"0'"9%(!"#$%<.<

! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(

%-"+)+)#*'+./'0-+-

! 6+7(8#)-.L(8:$).5$.9#).,"7(=,$:/"<(#"0(A,K<

1 2

! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(

%-"+)+)#*'+./'0-+-

! 6+7(8#)-.L(8:$).5$.9#).,"7(C$,9G<

1 2

! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(

%-"+)+)#*'+./'0-+-

! 6+7(8#)-.L(8:$).5$.9#).,"

1 2

! @."0(K#%<(),(5#-).).,"()*'(0#)#

! 6+7(8#)-.L(8:$).5$.9#).,"

1 2

slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)

Page 175: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

'

! !"#$%&'()*'(#$+,-.)*/(#"0(1."0(!"#$%&'#('

)*&+"$,+)#*& )*#)(#-'(2-'$#).3'$%4(."0'5'"0'")

! 6+7(8,$'9:$#-(;%"#/.9<! =,/5:)'>.?-#).,"#$@,-9'<

! =,/5:)'A,)#).,"#$@,-9'<

! =,/5:)';.*'0-#$@,-9'<

! =,/5:)'B'.+*?,:-<

! =,/5:)'B,"C,"0."+@,-9'<

! D50#)'E,<.).,"<!"0>'$,9.).'<

F#<G(;'9,/5,<.).,"

;#)#(;'9,/5,<.).,"

H-,:5(F#<G<

I-0'-(F#<G<

;#)#(J*#-."+

;'9,/5,<.).," ;'5'"0'"9%(!"#$%<.<

! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(

%-"+)+)#*'+./'0-+-

! 6+7(8#)-.L(8:$).5$.9#).,"7(=,$:/"<(#"0(A,K<

1 2

! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(

%-"+)+)#*'+./'0-+-

! 6+7(8#)-.L(8:$).5$.9#).,"7(C$,9G<

1 2

! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(

%-"+)+)#*'+./'0-+-

! 6+7(8#)-.L(8:$).5$.9#).,"

1 2

! @."0(K#%<(),(5#-).).,"()*'(0#)#

! 6+7(8#)-.L(8:$).5$.9#).,"

1 2

slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)

Page 176: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

0

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! !5'0'%"0'%*".7%:"7#%-)%3'()*+)#'%".7%

6,;'.%"96)0,-5*

! 4)*'-,*'#%3"-"%3'()*+)#'%'"#,97

! 4)*'-,*'#%-"#$#%3'()*+)#'%'"#,97

! 4)*'-,*'#%<)-5=! 4)*'-,*'#%.',-5'0=

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! 2.('%-5'%"96)0,-5*%5"#%<''.%3'()*+)#'3%

,.-)%3"-"%".3%-"#$#>

!8."97?' @.-'0"(-,).#

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! !)%'"#'%-5'%*"."6'*'.-%)A%3'+'.3'.(,'#%

A,.3%-"#$#%-5"-%"0'%#,*,9"0%".3%60)1+%-5'*

! !5'.%"."97?'%().#-0",.-#%-)%3'-'0*,.'%".7%

.'('##"07%)03'0

slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)

Page 177: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

0

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! !5'0'%"0'%*".7%:"7#%-)%3'()*+)#'%".7%

6,;'.%"96)0,-5*

! 4)*'-,*'#%3"-"%3'()*+)#'%'"#,97

! 4)*'-,*'#%-"#$#%3'()*+)#'%'"#,97

! 4)*'-,*'#%<)-5=! 4)*'-,*'#%.',-5'0=

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! 2.('%-5'%"96)0,-5*%5"#%<''.%3'()*+)#'3%

,.-)%3"-"%".3%-"#$#>

!8."97?' @.-'0"(-,).#

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! !)%'"#'%-5'%*"."6'*'.-%)A%3'+'.3'.(,'#%

A,.3%-"#$#%-5"-%"0'%#,*,9"0%".3%60)1+%-5'*

! !5'.%"."97?'%().#-0",.-#%-)%3'-'0*,.'%".7%

.'('##"07%)03'0

slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)

Page 178: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

0

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! !5'0'%"0'%*".7%:"7#%-)%3'()*+)#'%".7%

6,;'.%"96)0,-5*

! 4)*'-,*'#%3"-"%3'()*+)#'%'"#,97

! 4)*'-,*'#%-"#$#%3'()*+)#'%'"#,97

! 4)*'-,*'#%<)-5=! 4)*'-,*'#%.',-5'0=

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! 2.('%-5'%"96)0,-5*%5"#%<''.%3'()*+)#'3%

,.-)%3"-"%".3%-"#$#>

!8."97?' @.-'0"(-,).#

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! !)%'"#'%-5'%*"."6'*'.-%)A%3'+'.3'.(,'#%

A,.3%-"#$#%-5"-%"0'%#,*,9"0%".3%60)1+%-5'*

! !5'.%"."97?'%().#-0",.-#%-)%3'-'0*,.'%".7%

.'('##"07%)03'0

slide by Matthew Bolitho

Page 179: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

0

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! !5'0'%"0'%*".7%:"7#%-)%3'()*+)#'%".7%

6,;'.%"96)0,-5*

! 4)*'-,*'#%3"-"%3'()*+)#'%'"#,97

! 4)*'-,*'#%-"#$#%3'()*+)#'%'"#,97

! 4)*'-,*'#%<)-5=! 4)*'-,*'#%.',-5'0=

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! 2.('%-5'%"96)0,-5*%5"#%<''.%3'()*+)#'3%

,.-)%3"-"%".3%-"#$#>

!8."97?' @.-'0"(-,).#

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! !)%'"#'%-5'%*"."6'*'.-%)A%3'+'.3'.(,'#%

A,.3%-"#$#%-5"-%"0'%#,*,9"0%".3%60)1+%-5'*

! !5'.%"."97?'%().#-0",.-#%-)%3'-'0*,.'%".7%

.'('##"07%)03'0

slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)

Page 180: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

F

! !"#$%&$#'($#)%*%+$)$*'#",#-$.$*-$*/0$&#,0*-#'%&1&#'(%'#%2$#&0)03%2#%*-#+2"4.#'($)

! 5+6#7"3$/43%2#89*%)0/&! :").4'$;0<2%'0"*%3="2/$&

! :").4'$>"'%'0"*%3="2/$&

! :").4'$80($-2%3="2/$&

! :").4'$?$0+(<"42&

! :").4'$?"*@"*-0*+="2/$&

! A.-%'$B"&0'0"*&C*-;$3"/0'0$&

! !"#$%&$#'($#)%*%+$)$*'#",#-$.$*-$*/0$&#,0*-#'%&1&#'(%'#%2$#&0)03%2#%*-#+2"4.#'($)

! 5+6#7"3$/43%2#89*%)0/&! :").4'$;0<2%'0"*%3="2/$&

! :").4'$>"'%'0"*%3="2/$&

! :").4'$80($-2%3="2/$&

! :").4'$?$0+(<"42&

! :").4'$?"*@"*-0*+="2/$&

! A.-%'$B"&0'0"*&C*-;$3"/0'0$&

! :").4'$#@"*-$-#="2/$&! :").4'$;0<2%'0"*%3="2/$&

! :").4'$>"'%'0"*%3="2/$&

! :").4'$80($-2%3="2/$&

! :").4'$#?$0+(<"42&! :").4'$#?"*D@"*-0*+#="2/$&

! A.-%'$B"&0'0"*&C*-;$3"/0'0$&

! E*/$#+2"4.&#",#'%&1&#%2$#0-$*'0,0$-F#-%'%#,3"G#

/"*&'2%0*'&#$*,"2/$#%#.%2'0%3#"2-$26

! E*/$#+2"4.&#",#'%&1&#%2$#0-$*'0,0$-F#-%'%#,3"G#

/"*&'2%0*'&#$*,"2/$#%#.%2'0%3#"2-$26

A.-%'$#B"&0'0"*&#%*-#;$3"/0'0$&

?"*#@"*-$-#="2/$&

?$0+(<"2#H0&'

@"*-$-#="2/$&

!%&1#8$/")."&0'0"*

8%'%#8$/")."&0'0"*

I2"4.#!%&1&

E2-$2#!%&1&

8%'%#J(%20*+

8$/")."&0'0"* 8$.$*-$*/9#C*%39&0&

slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)

Page 181: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

F

! !"#$%&$#'($#)%*%+$)$*'#",#-$.$*-$*/0$&#,0*-#'%&1&#'(%'#%2$#&0)03%2#%*-#+2"4.#'($)

! 5+6#7"3$/43%2#89*%)0/&! :").4'$;0<2%'0"*%3="2/$&

! :").4'$>"'%'0"*%3="2/$&

! :").4'$80($-2%3="2/$&

! :").4'$?$0+(<"42&

! :").4'$?"*@"*-0*+="2/$&

! A.-%'$B"&0'0"*&C*-;$3"/0'0$&

! !"#$%&$#'($#)%*%+$)$*'#",#-$.$*-$*/0$&#,0*-#'%&1&#'(%'#%2$#&0)03%2#%*-#+2"4.#'($)

! 5+6#7"3$/43%2#89*%)0/&! :").4'$;0<2%'0"*%3="2/$&

! :").4'$>"'%'0"*%3="2/$&

! :").4'$80($-2%3="2/$&

! :").4'$?$0+(<"42&

! :").4'$?"*@"*-0*+="2/$&

! A.-%'$B"&0'0"*&C*-;$3"/0'0$&

! :").4'$#@"*-$-#="2/$&! :").4'$;0<2%'0"*%3="2/$&

! :").4'$>"'%'0"*%3="2/$&

! :").4'$80($-2%3="2/$&

! :").4'$#?$0+(<"42&! :").4'$#?"*D@"*-0*+#="2/$&

! A.-%'$B"&0'0"*&C*-;$3"/0'0$&

! E*/$#+2"4.&#",#'%&1&#%2$#0-$*'0,0$-F#-%'%#,3"G#

/"*&'2%0*'&#$*,"2/$#%#.%2'0%3#"2-$26

! E*/$#+2"4.&#",#'%&1&#%2$#0-$*'0,0$-F#-%'%#,3"G#

/"*&'2%0*'&#$*,"2/$#%#.%2'0%3#"2-$26

A.-%'$#B"&0'0"*&#%*-#;$3"/0'0$&

?"*#@"*-$-#="2/$&

?$0+(<"2#H0&'

@"*-$-#="2/$&

!%&1#8$/")."&0'0"*

8%'%#8$/")."&0'0"*

I2"4.#!%&1&

E2-$2#!%&1&

8%'%#J(%20*+

8$/")."&0'0"* 8$.$*-$*/9#C*%39&0&

slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)

Page 182: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

F

! !"#$%&$#'($#)%*%+$)$*'#",#-$.$*-$*/0$&#,0*-#'%&1&#'(%'#%2$#&0)03%2#%*-#+2"4.#'($)

! 5+6#7"3$/43%2#89*%)0/&! :").4'$;0<2%'0"*%3="2/$&

! :").4'$>"'%'0"*%3="2/$&

! :").4'$80($-2%3="2/$&

! :").4'$?$0+(<"42&

! :").4'$?"*@"*-0*+="2/$&

! A.-%'$B"&0'0"*&C*-;$3"/0'0$&

! !"#$%&$#'($#)%*%+$)$*'#",#-$.$*-$*/0$&#,0*-#'%&1&#'(%'#%2$#&0)03%2#%*-#+2"4.#'($)

! 5+6#7"3$/43%2#89*%)0/&! :").4'$;0<2%'0"*%3="2/$&

! :").4'$>"'%'0"*%3="2/$&

! :").4'$80($-2%3="2/$&

! :").4'$?$0+(<"42&

! :").4'$?"*@"*-0*+="2/$&

! A.-%'$B"&0'0"*&C*-;$3"/0'0$&

! :").4'$#@"*-$-#="2/$&! :").4'$;0<2%'0"*%3="2/$&

! :").4'$>"'%'0"*%3="2/$&

! :").4'$80($-2%3="2/$&

! :").4'$#?$0+(<"42&! :").4'$#?"*D@"*-0*+#="2/$&

! A.-%'$B"&0'0"*&C*-;$3"/0'0$&

! E*/$#+2"4.&#",#'%&1&#%2$#0-$*'0,0$-F#-%'%#,3"G#

/"*&'2%0*'&#$*,"2/$#%#.%2'0%3#"2-$26

! E*/$#+2"4.&#",#'%&1&#%2$#0-$*'0,0$-F#-%'%#,3"G#

/"*&'2%0*'&#$*,"2/$#%#.%2'0%3#"2-$26

A.-%'$#B"&0'0"*&#%*-#;$3"/0'0$&

?"*#@"*-$-#="2/$&

?$0+(<"2#H0&'

@"*-$-#="2/$&

!%&1#8$/")."&0'0"*

8%'%#8$/")."&0'0"*

I2"4.#!%&1&

E2-$2#!%&1&

8%'%#J(%20*+

8$/")."&0'0"* 8$.$*-$*/9#C*%39&0&

slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)

Page 183: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

F

! !"#$%&$#'($#)%*%+$)$*'#",#-$.$*-$*/0$&#,0*-#'%&1&#'(%'#%2$#&0)03%2#%*-#+2"4.#'($)

! 5+6#7"3$/43%2#89*%)0/&! :").4'$;0<2%'0"*%3="2/$&

! :").4'$>"'%'0"*%3="2/$&

! :").4'$80($-2%3="2/$&

! :").4'$?$0+(<"42&

! :").4'$?"*@"*-0*+="2/$&

! A.-%'$B"&0'0"*&C*-;$3"/0'0$&

! !"#$%&$#'($#)%*%+$)$*'#",#-$.$*-$*/0$&#,0*-#'%&1&#'(%'#%2$#&0)03%2#%*-#+2"4.#'($)

! 5+6#7"3$/43%2#89*%)0/&! :").4'$;0<2%'0"*%3="2/$&

! :").4'$>"'%'0"*%3="2/$&

! :").4'$80($-2%3="2/$&

! :").4'$?$0+(<"42&

! :").4'$?"*@"*-0*+="2/$&

! A.-%'$B"&0'0"*&C*-;$3"/0'0$&

! :").4'$#@"*-$-#="2/$&! :").4'$;0<2%'0"*%3="2/$&

! :").4'$>"'%'0"*%3="2/$&

! :").4'$80($-2%3="2/$&

! :").4'$#?$0+(<"42&! :").4'$#?"*D@"*-0*+#="2/$&

! A.-%'$B"&0'0"*&C*-;$3"/0'0$&

! E*/$#+2"4.&#",#'%&1&#%2$#0-$*'0,0$-F#-%'%#,3"G#

/"*&'2%0*'&#$*,"2/$#%#.%2'0%3#"2-$26

! E*/$#+2"4.&#",#'%&1&#%2$#0-$*'0,0$-F#-%'%#,3"G#

/"*&'2%0*'&#$*,"2/$#%#.%2'0%3#"2-$26

A.-%'$#B"&0'0"*&#%*-#;$3"/0'0$&

?"*#@"*-$-#="2/$&

?$0+(<"2#H0&'

@"*-$-#="2/$&

!%&1#8$/")."&0'0"*

8%'%#8$/")."&0'0"*

I2"4.#!%&1&

E2-$2#!%&1&

8%'%#J(%20*+

8$/")."&0'0"* 8$.$*-$*/9#C*%39&0&

slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)

Page 184: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

F$

! !"#$%&'()*'++,%-(.$($.%/(-0&1%-2%)'131%'".%

&'()*)*-"1%-2%.')'%'($%*.$")*2*$.4%'"'+,5$%)6$%

!"#"$%&"'()*$)6')%-##0(1

! 7')'%16'(*"/%#'"%8$%#')$/-(*5$.%'19

! :$'.;-"+,

! <22$#)*=$+,%>-#'+

! :$'.;?(*)$

! @##0A0+')$

! B0+)*&+$%:$'.CD*"/+$%?(*)$

+,"!-.)/0

! 7')'%*1%($'.4%80)%"-)%E(*))$"

! F-%#-"1*1)$"#,%&(-8+$A1

! :$&+*#')*-"%*"%.*1)(*80)$.%1,1)$A

122,3#(4,/0-5.3"/

! 7')'%*1%($'.%'".%E(*))$"

! 7')'%*1%&'()*)*-"$.%*")-%1081$)1

! !"$%)'13%&$(%1081$)

! G'"%.*1)(*80)$%1081$)1

+,"!-6'(#,

! 7')'%*1%($'.%'".%E(*))$"

! B'",%)'131%'##$11%A'",%.')'

! G-"1*1)$"#,%*110$1

! B-1)%.*22*#0+)%)-%.$'+%E*)6

+,"!-6'(#,$!733898/"#(.)%

! @1%&$(%:$'.;?(*)$4%'+)6-0/6%E(*)$1%#-"1*1)%-2%'"%

'##0A0+')*-"%-&$(')*-"

! G-AA-"%*"%($.0#)*-";),&$%'+/-(*)6A1

! G'"%($&+*#')$%1*"#$%'##0A0+')*-"%#'"%8$%+*"$'(

slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)

Page 185: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

F$

! !"#$%&'()*'++,%-(.$($.%/(-0&1%-2%)'131%'".%

&'()*)*-"1%-2%.')'%'($%*.$")*2*$.4%'"'+,5$%)6$%

!"#"$%&"'()*$)6')%-##0(1

! 7')'%16'(*"/%#'"%8$%#')$/-(*5$.%'19

! :$'.;-"+,

! <22$#)*=$+,%>-#'+

! :$'.;?(*)$

! @##0A0+')$

! B0+)*&+$%:$'.CD*"/+$%?(*)$

+,"!-.)/0

! 7')'%*1%($'.4%80)%"-)%E(*))$"

! F-%#-"1*1)$"#,%&(-8+$A1

! :$&+*#')*-"%*"%.*1)(*80)$.%1,1)$A

122,3#(4,/0-5.3"/

! 7')'%*1%($'.%'".%E(*))$"

! 7')'%*1%&'()*)*-"$.%*")-%1081$)1

! !"$%)'13%&$(%1081$)

! G'"%.*1)(*80)$%1081$)1

+,"!-6'(#,

! 7')'%*1%($'.%'".%E(*))$"

! B'",%)'131%'##$11%A'",%.')'

! G-"1*1)$"#,%*110$1

! B-1)%.*22*#0+)%)-%.$'+%E*)6

+,"!-6'(#,$!733898/"#(.)%

! @1%&$(%:$'.;?(*)$4%'+)6-0/6%E(*)$1%#-"1*1)%-2%'"%

'##0A0+')*-"%-&$(')*-"

! G-AA-"%*"%($.0#)*-";),&$%'+/-(*)6A1

! G'"%($&+*#')$%1*"#$%'##0A0+')*-"%#'"%8$%+*"$'(

slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)

Page 186: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

F$

! !"#$%&'()*'++,%-(.$($.%/(-0&1%-2%)'131%'".%

&'()*)*-"1%-2%.')'%'($%*.$")*2*$.4%'"'+,5$%)6$%

!"#"$%&"'()*$)6')%-##0(1

! 7')'%16'(*"/%#'"%8$%#')$/-(*5$.%'19

! :$'.;-"+,

! <22$#)*=$+,%>-#'+

! :$'.;?(*)$

! @##0A0+')$

! B0+)*&+$%:$'.CD*"/+$%?(*)$

+,"!-.)/0

! 7')'%*1%($'.4%80)%"-)%E(*))$"

! F-%#-"1*1)$"#,%&(-8+$A1

! :$&+*#')*-"%*"%.*1)(*80)$.%1,1)$A

122,3#(4,/0-5.3"/

! 7')'%*1%($'.%'".%E(*))$"

! 7')'%*1%&'()*)*-"$.%*")-%1081$)1

! !"$%)'13%&$(%1081$)

! G'"%.*1)(*80)$%1081$)1

+,"!-6'(#,

! 7')'%*1%($'.%'".%E(*))$"

! B'",%)'131%'##$11%A'",%.')'

! G-"1*1)$"#,%*110$1

! B-1)%.*22*#0+)%)-%.$'+%E*)6

+,"!-6'(#,$!733898/"#(.)%

! @1%&$(%:$'.;?(*)$4%'+)6-0/6%E(*)$1%#-"1*1)%-2%'"%

'##0A0+')*-"%-&$(')*-"

! G-AA-"%*"%($.0#)*-";),&$%'+/-(*)6A1

! G'"%($&+*#')$%1*"#$%'##0A0+')*-"%#'"%8$%+*"$'(

slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)

Page 187: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

F$

! !"#$%&'()*'++,%-(.$($.%/(-0&1%-2%)'131%'".%

&'()*)*-"1%-2%.')'%'($%*.$")*2*$.4%'"'+,5$%)6$%

!"#"$%&"'()*$)6')%-##0(1

! 7')'%16'(*"/%#'"%8$%#')$/-(*5$.%'19

! :$'.;-"+,

! <22$#)*=$+,%>-#'+

! :$'.;?(*)$

! @##0A0+')$

! B0+)*&+$%:$'.CD*"/+$%?(*)$

+,"!-.)/0

! 7')'%*1%($'.4%80)%"-)%E(*))$"

! F-%#-"1*1)$"#,%&(-8+$A1

! :$&+*#')*-"%*"%.*1)(*80)$.%1,1)$A

122,3#(4,/0-5.3"/

! 7')'%*1%($'.%'".%E(*))$"

! 7')'%*1%&'()*)*-"$.%*")-%1081$)1

! !"$%)'13%&$(%1081$)

! G'"%.*1)(*80)$%1081$)1

+,"!-6'(#,

! 7')'%*1%($'.%'".%E(*))$"

! B'",%)'131%'##$11%A'",%.')'

! G-"1*1)$"#,%*110$1

! B-1)%.*22*#0+)%)-%.$'+%E*)6

+,"!-6'(#,$!733898/"#(.)%

! @1%&$(%:$'.;?(*)$4%'+)6-0/6%E(*)$1%#-"1*1)%-2%'"%

'##0A0+')*-"%-&$(')*-"

! G-AA-"%*"%($.0#)*-";),&$%'+/-(*)6A1

! G'"%($&+*#')$%1*"#$%'##0A0+')*-"%#'"%8$%+*"$'(

slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)

Page 188: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

F$

! !"#$%&'()*'++,%-(.$($.%/(-0&1%-2%)'131%'".%

&'()*)*-"1%-2%.')'%'($%*.$")*2*$.4%'"'+,5$%)6$%

!"#"$%&"'()*$)6')%-##0(1

! 7')'%16'(*"/%#'"%8$%#')$/-(*5$.%'19

! :$'.;-"+,

! <22$#)*=$+,%>-#'+

! :$'.;?(*)$

! @##0A0+')$

! B0+)*&+$%:$'.CD*"/+$%?(*)$

+,"!-.)/0

! 7')'%*1%($'.4%80)%"-)%E(*))$"

! F-%#-"1*1)$"#,%&(-8+$A1

! :$&+*#')*-"%*"%.*1)(*80)$.%1,1)$A

122,3#(4,/0-5.3"/

! 7')'%*1%($'.%'".%E(*))$"

! 7')'%*1%&'()*)*-"$.%*")-%1081$)1

! !"$%)'13%&$(%1081$)

! G'"%.*1)(*80)$%1081$)1

+,"!-6'(#,

! 7')'%*1%($'.%'".%E(*))$"

! B'",%)'131%'##$11%A'",%.')'

! G-"1*1)$"#,%*110$1

! B-1)%.*22*#0+)%)-%.$'+%E*)6

+,"!-6'(#,$!733898/"#(.)%

! @1%&$(%:$'.;?(*)$4%'+)6-0/6%E(*)$1%#-"1*1)%-2%'"%

'##0A0+')*-"%-&$(')*-"

! G-AA-"%*"%($.0#)*-";),&$%'+/-(*)6A1

! G'"%($&+*#')$%1*"#$%'##0A0+')*-"%#'"%8$%+*"$'(

slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)

Page 189: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

FF

!"#$%&'()"*!+,-)(.-"*!"#$/0(12-"*&'()"

! !"#$%&#'%()*+&,-%.#(/-01230#14/5#14%#-("6#7&,-%"

! '%/(8%)#914","-%495#914"-&(,4-"

! :8(;$/%<#=(-&,8#=2/-,$/,9(-,14

3 4

! :8(;$/%<#=(-&,8#=2/-,$/,9(-,14

3 4

'%()*>4/5

'%()*>4/5

:??%9-,@%/5*

A19(/

! :8(;$/%<#=1/%92/(&#B54(;,9"

C$)(-%#D1",-,14"#(4)#E%/19,-,%"

F14#G14)%)#H1&9%"

F%,30I1&#A,"-

G14)%)#H1&9%"

H1&9%"

! :8(;$/%<#=1/%92/(&#B54(;,9"

C$)(-%#D1",-,14"#(4)#E%/19,-,%"

F14#G14)%)#H1&9%"

F%,30I1&#A,"-

G14)%)#H1&9%"

!-1;,9#

J11&),4(-%"

slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)

Page 190: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

FF

!"#$%&'()"*!+,-)(.-"*!"#$/0(12-"*&'()"

! !"#$%&#'%()*+&,-%.#(/-01230#14/5#14%#-("6#7&,-%"

! '%/(8%)#914","-%495#914"-&(,4-"

! :8(;$/%<#=(-&,8#=2/-,$/,9(-,14

3 4

! :8(;$/%<#=(-&,8#=2/-,$/,9(-,14

3 4

'%()*>4/5

'%()*>4/5

:??%9-,@%/5*

A19(/

! :8(;$/%<#=1/%92/(&#B54(;,9"

C$)(-%#D1",-,14"#(4)#E%/19,-,%"

F14#G14)%)#H1&9%"

F%,30I1&#A,"-

G14)%)#H1&9%"

H1&9%"

! :8(;$/%<#=1/%92/(&#B54(;,9"

C$)(-%#D1",-,14"#(4)#E%/19,-,%"

F14#G14)%)#H1&9%"

F%,30I1&#A,"-

G14)%)#H1&9%"

!-1;,9#

J11&),4(-%"

slide by Matthew Bolitho

Page 191: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

FF

!"#$%&'()"*!+,-)(.-"*!"#$/0(12-"*&'()"

! !"#$%&#'%()*+&,-%.#(/-01230#14/5#14%#-("6#7&,-%"

! '%/(8%)#914","-%495#914"-&(,4-"

! :8(;$/%<#=(-&,8#=2/-,$/,9(-,14

3 4

! :8(;$/%<#=(-&,8#=2/-,$/,9(-,14

3 4

'%()*>4/5

'%()*>4/5

:??%9-,@%/5*

A19(/

! :8(;$/%<#=1/%92/(&#B54(;,9"

C$)(-%#D1",-,14"#(4)#E%/19,-,%"

F14#G14)%)#H1&9%"

F%,30I1&#A,"-

G14)%)#H1&9%"

H1&9%"

! :8(;$/%<#=1/%92/(&#B54(;,9"

C$)(-%#D1",-,14"#(4)#E%/19,-,%"

F14#G14)%)#H1&9%"

F%,30I1&#A,"-

G14)%)#H1&9%"

!-1;,9#

J11&),4(-%"

slide by Matthew Bolitho

Page 192: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Useful patterns(for reference)

Page 193: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Embarrassingly Parallel

yi = fi(xi)where i ∈ {1, . . . ,N}.

Notation: (also for rest of this lecture)

• xi : inputs

• yi : outputs

• fi : (pure) functions (i.e. no side effects)

When does a function have a “side effect”?

In addition to producing a value, it

• modifies non-local state, or

• has an observable interaction with the

outside world.

Often: f1 = · · · = fN . Then

• Lisp/Python function map

• C++ STL std::transform

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 194: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Embarrassingly Parallel

yi = fi(xi)where i ∈ {1, . . . ,N}.

Notation: (also for rest of this lecture)

• xi : inputs

• yi : outputs

• fi : (pure) functions (i.e. no side effects)

When does a function have a “side effect”?

In addition to producing a value, it

• modifies non-local state, or

• has an observable interaction with the

outside world.

Often: f1 = · · · = fN . Then

• Lisp/Python function map

• C++ STL std::transform

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 195: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Embarrassingly Parallel

yi = fi(xi)where i ∈ {1, . . . ,N}.

Notation: (also for rest of this lecture)

• xi : inputs

• yi : outputs

• fi : (pure) functions (i.e. no side effects)

When does a function have a “side effect”?

In addition to producing a value, it

• modifies non-local state, or

• has an observable interaction with the

outside world.

Often: f1 = · · · = fN . Then

• Lisp/Python function map

• C++ STL std::transform

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 196: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Embarrassingly Parallel: Graph Representation

x0

y0

f0

x1

y1

f1

x2

y2

f2

x3

y3

f3

x4

y4

f4

x5

y5

f5

x6

y6

f6

x7

y7

f7

x8

y8

f8

Trivial? Often: no.

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 197: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Embarrassingly Parallel: Examples

Surprisingly useful:

• Element-wise linear algebra:Addition, scalar multiplication (notinner product)

• Image Processing: Shift, rotate,clip, scale, . . .

• Monte Carlo simulation

• (Brute-force) Optimization

• Random Number Generation

• Encryption, Compression(after blocking)

• Software compilation• make -j8

But: Still needs a minimum ofcoordination. How can that beachieved?

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 198: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Mother-Child ParallelismMother-Child parallelism:

Mother

0 1 2 3 4

Children

Send initial data

Collect results

(formerly called “Master-Slave”)Embarrassing Partition Pipelines Reduction Scan

slide from Berger & Klöckner (NYU 2010)

Page 199: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Embarrassingly Parallel: Issues

• Process Creation:Dynamic/Static?

• MPI 2 supports dynamic processcreation

• Job Assignment (‘Scheduling’):Dynamic/Static?

• Operations/data light- orheavy-weight?

• Variable-size data?• Load Balancing:

• Here: easy

Can you think of a loadbalancing recipe?

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 200: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Partition

yi = fi(xi−1, xi , xi+1)

where i ∈ {1, . . . ,N}.

Includes straightforward generalizations to dependencies on a larger(but not O(P)-sized!) set of neighbor inputs.

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 201: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Partition: Graph

x0 x1 x2 x3 x4 x5 x6

y1 y2 y3 y4 y5

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 202: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Partition: Examples

• Time-marching(in particular: PDE solvers)

• (Including finite differences → HW3!)

• Iterative Methods• Solve Ax = b (Jacobi, . . . )• Optimization (all P on single problem)• Eigenvalue solvers

• Cellular Automata (Game of Life :-)

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

)

Page 203: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Partition: Issues

• Only useful when the computation

is mainly local

• Responsibility for updating one

datum rests with one processor

• Synchronization, Deadlock,

Livelock, . . .

• Performance Impact

• Granularity

• Load Balancing: Thorny issue

• → next lecture

• Regularity of the Partition?

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 204: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Pipelined Computation

y = fN(· · · f2(f1(x)) · · · )= (fN ◦ · · · ◦ f1)(x)

where N is fixed.

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 205: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Pipelined Computation: Graph

x yf1 f1 f2 f3 f4 f6

Processor Assignment?

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 206: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Pipelined Computation: Examples

• Image processing

• Any multi-stage algorithm

• Pre/post-processing or I/O

• Out-of-Core algorithms

Specific simple examples:

• Sorting (insertion sort)

• Triangular linear system solve

(‘backsubstitution’)

• Key: Pass on values as soon as

they’re available

(will see more efficient algorithms for

both later)

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 207: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Pipelined Computation: Issues

• Non-optimal while pipeline fills or

empties

• Often communication-inefficient

• for large data

• Needs some attention to

synchronization, deadlock

avoidance

• Can accommodate some

asynchrony

But don’t want:

• Pile-up

• Starvation

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 208: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Reduction

y = f (· · · f (f (x1, x2), x3), . . . , xN)

where N is the input size.

Also known as. . .

• Lisp/Python function reduce (Scheme: fold)

• C++ STL std::accumulate

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 209: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Reduction: Graph

y

x1 x2

x3

x4

x5

x6

Painful! Not parallelizable.

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 210: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Approach to Reduction

f (x ,y)?

Can we do better?

“Tree” very imbalanced. What property

of f would allow ‘rebalancing’?

f (f (x , y), z) = f (x , f (y , z))

Looks less improbable if we let

x ◦ y = f (x , y):

x ◦ (y ◦ z)) = (x ◦ y) ◦ z

Has a very familiar name: Associativity

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 211: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Reduction: A Better Graph

y

x0 x1 x2 x3 x4 x5 x6 x7

Processor allocation?

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 212: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Mapping Reduction to the GPU

• Obvious: Want to use tree-based approach.

• Problem: Two scales, Work group and Grid

• Need to occupy both to make good use of the machine.

• In particular, need synchronization after each tree stage.

• Solution: Use a two-scale algorithm.

5

Solution: Kernel DecompositionSolution: Kernel Decomposition

Avoid global sync by decomposing computation into multiple kernel invocations

In the case of reductions, code for all levels is the same

Recursive kernel invocation

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

Level 0:8 blocks

Level 1:1 block

In particular: Use multiple grid invocations to achieve

inter-workgroup synchronization.With material by M. Harris

(Nvidia Corp.)

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 213: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Interleaved Addressing

8

Parallel Reduction: Interleaved AddressingParallel Reduction: Interleaved Addressing

2011072-3-253-20-18110Values (shared memory)

0 2 4 6 8 10 12 14

22111179-3-558-2-2-17111Values

0 4 8 12

22111379-3458-26-17118Values

0 8

22111379-31758-26-17124Values

0

22111379-31758-26-17141Values

Thread IDs

Step 1 Stride 1

Step 2 Stride 2

Step 3 Stride 4

Step 4 Stride 8

Thread IDs

Thread IDs

Thread IDs

Issue: Slow modulo, Divergence

With material by M. Harris

(Nvidia Corp.)

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 214: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Sequential Addressing

14

Parallel Reduction: Sequential AddressingParallel Reduction: Sequential Addressing

2011072-3-253-20-18110Values (shared memory)

0 1 2 3 4 5 6 7

2011072-3-27390610-28Values

0 1 2 3

2011072-3-27390131378Values

0 1

2011072-3-2739013132021Values

0

2011072-3-2739013132041Values

Thread IDs

Step 1 Stride 8

Step 2 Stride 4

Step 3 Stride 2

Step 4 Stride 1

Thread IDs

Thread IDs

Thread IDs

Sequential addressing is conflict freeBetter! But still not “efficient”.

Only half of all work items after first round,

then a quarter, . . . With material by M. Harris

(Nvidia Corp.)

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 215: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Reduction: Examples

• Sum, Inner Product, Norm

• Occurs in iterative methods

• Minimum, Maximum

• Data Analysis

• Evaluation of Monte Carlo

Simulations

• List Concatenation, Set Union

• Matrix-Vector product (but. . . )

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 216: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Reduction: Issues

• When adding: floating point

cancellation?

• Serial order goes faster:

can use registers for intermediate

results

• Requires availability of neutral

element

• GPU-Reduce: Optimization

sensitive to data type

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 217: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Map-Reduce

y = f (· · · f (f (g(x1), g(x2)),g(x3)), . . . , g(xN))

where N is the input size.

• Lisp naming, again

• Mild generalization of reduction

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 218: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Map-Reduce: Graph

y

x0

g

x1

g

x2

g

x3

g

x4

g

x5

g

x6

g

x7

g

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 219: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

MapReduce: Discussion

MapReduce ≥ map + reduce:

• Used by Google (and many others) forlarge-scale data processing

• Map generates (key, value) pairs• Reduce operates only on pairs with

identical keys• Remaining output sorted by key

• Represent all data as character strings• User must convert to/from internal repr.

• Messy implementation• Parallelization, fault tolerance, monitoring,

data management, load balance, re-run“stragglers”, data locality

• Works for Internet-size data

• Simple to use even for inexperienced users

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 220: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

MapReduce: Examples

• String search

• (e.g. URL) Hit count from Log

• Reverse web-link graph

• desired: (target URL, sources)

• Sort

• Indexing

• desired: (word, document IDs)

• Machine Learning, Clustering, . . .

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 221: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Scan

y1 = x1y2 = f (y1, x2)... = ...

yN = f (yN−1, xN)

where N is the input size.

• Also called “prefix sum”.

• Or cumulative sum (‘cumsum’) by Matlab/NumPy.

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 222: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Scan: Graph

x0

y0

x1

y1

x2

y2

x3

y3

x4

y4

x5

y5

y1

Id

y2

Id

y3

Id

y4

Id y5

Id

Id

This can’t possibly be parallelized.

Or can it?

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 223: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Scan: Graph

x0

y0

x1

y1

x2

y2

x3

y3

x4

y4

x5

y5

y1

Id

y2

Id

y3

Id

y4

Id y5

Id

Id

This can’t possibly be parallelized.

Or can it?

Again: Need assumptions on f .Associativity, commutativity.

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 224: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Scan: Implementation

Work-efficient?

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 225: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Scan: Implementation II

Two sweeps: Upward, downward,

both tree-shape

On upward sweep:

• Get values L and R from left and right

child

• Save L in local variable Mine

• Compute Tmp = L+ R and pass to parent

On downward sweep:

• Get value Tmp from parent

• Send Tmp to left child

• Sent Tmp+Mine to right child

Work-efficient?

Span rel. to first attempt?

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 226: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Scan: Implementation II

Two sweeps: Upward, downward,

both tree-shape

On upward sweep:

• Get values L and R from left and right

child

• Save L in local variable Mine

• Compute Tmp = L+ R and pass to parent

On downward sweep:

• Get value Tmp from parent

• Send Tmp to left child

• Sent Tmp+Mine to right childWork-efficient?

Span rel. to first attempt?

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 227: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Scan: Examples

• Anything with a loop-carried

dependence

• One row of Gauss-Seidel

• One row of triangular solve

• Segment numbering if boundaries

are known

• Low-level building block for many

higher-level algorithms algorithms

• FIR/IIR Filtering

• G.E. Blelloch:

Prefix Sums and their Applications

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 228: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Scan: Issues

• Subtlety: Inclusive/Exclusive Scan• Pattern sometimes hard torecognize

• But shows up surprisingly often• Need to prove

associativity/commutativity

• Useful in Implementation:algorithm cascading

• Do sequential scan on parts, thenparallelize at coarser granularities

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)

Page 229: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Divide and Conquer

yi = fi(x1, . . . , xN)for i ∈ {1, dots,M}.

Main purpose: A way ofpartitioning up fullydependent tasks.

x0 x1 x2 x3 x4 x5 x6 x7

x0 x1 x2 x3 x4 x5 x6 x7

x0 x1 x2 x3 x4 x5 x6 x7

u0 u1 u2 u3 u4 u5 u6 u7

x0

y0

x1

y1

x2

y2

x3

y3

x4

y4

x5

y5

x6

y6

x7

y7

v0 v1 v2 v3 v4 v5 v6 v7

w0 w1 w2 w3 w4 w5 w6 w7Processor allocation?

D&C Generalslide from Berger & Klöckner (NYU 2010)

Page 230: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Divide and Conquer: Examples

• GEMM, TRMM, TRSM, GETRF

(LU)

• FFT

• Sorting: Bucket sort, Merge sort

• N-Body problems (Barnes-Hut,

FMM)

• Adaptive Integration

More fun with work and span:

D&C analysis lecture

D&C Generalslide from Berger & Klöckner (NYU 2010)

Page 231: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Divide and Conquer: Issues

• “No idea how to parallelize that”• → Try D&C

• Non-optimal during partition, merge• But: Does not matter if deep levels do

heavy enough processing

• Subtle to map to fixed-width machines(e.g. GPUs)

• Varying data size along tree

• Bookkeeping nontrivial for non-2n sizes

• Side benefit: D&C is generallycache-friendly

D&C Generalslide from Berger & Klöckner (NYU 2010)

Page 232: [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

COME