Download - [Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

Lecture #2: Architecture, Theory & Patterns | February 1st, 2011

Nicolas Pinto (MIT, Harvard) [email protected]

Massively Parallel ComputingCS 264 / CSCI E-292

mailto:[email protected]

mailto:[email protected]

Objectives

• introduce important computational thinking skills for massively parallel computing

• understand hardware limitations

• understand algorithm constraints

• identify common patterns

During this course,

we’ll try to

and use existing material ;-)

“ ”

adapted for CS264

Outline

• Thinking Parallel

• Architecture

• Programming Model

• Bits of Theory

• Patterns

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

F

!"#$%&"'()'*+&+,,",'-./0%$123

!"#"$$%$&'()*+,-./&-0&"&1(#)&(1&'()*+,-./&-.&

23-'3&)".4&-.0,#+',-(.0&"#%&'"##-%5&(+,&

0-)+$,".%(+0$441510"61+

! 7&+61$1.2+,,8)',+&3"9'":0"2;1<"9';0"#1+,1="6

! >:.$1#'?%0"&#./0%$"&;'@>3)'-&+8A

! B1;$&1C%$"6'?8;$"/;'@>3)'D?-E'4F1$"9'G,%"H"2"A'

! *+&+,,",'#./0%$123'I+;'$&+61$1.2+,,8'

12+##";;1C,"'$.'$F"'#.//.61$8'/+&5"$0,+#"

!"#$%&'()*$+,-.%/'0%(,1,(2(%&'()'1$1-%&'3-3%#43%

-.'#%"0%5&",&"&#",%&(1&#(+/3$4&"&1"',(#&(1&,2(&*%#&4%"#&666&

7%#,"-.$4&(8%#&,3%&03(#,&,%#)&,3-0&#",%&'".&9%&%:*%',%5&,(&

'(.,-.+%;&-1&.(,&,(&-.'#%"0%6&<8%#&,3%&$(./%#&,%#);&,3%&

#",%&(1&-.'#%"0%&-0&"&9-,&)(#%&+.'%#,"-.;&"$,3(+/3&,3%#%&-0&

.(&#%"0(.&,(&9%$-%8%&-,&2-$$&.(,&#%)"-.&.%"#$4&'(.0,".,&1(#&

",&$%"0,&=>&4%"#06&?3",&)%".0&94&=@AB;&,3%&.+)9%#&(1&

'()*(.%.,0&*%#&-.,%/#",%5&'-#'+-,&1(#&)-.-)+)&'(0,&2-$$&

9%&CB;>>>6&D&9%$-%8%&,3",&0+'3&"&$"#/%&'-#'+-,&'".&9%&9+-$,&

'1%4%3,15*$%64/$07

H.&6.2'J..&"9'>,"#$&.21#;'J+3+=12"9'KL'D0&1,'KLMN

! 7F"'/.;$'"#.2./1#'2%/C"&'.O'#./0.2"2$;'

12'+2'E-'I1,,'6.%C,"'"<"&8'8"+&

! P1;$.&1#+,,8'!-*Q;'3"$'O+;$"&

"P+&6I+&"'&"+#F123'O&"R%"2#8',1/1$+$1.2;

! S.I'!-*Q;'3"$'I16"&

! T+$F"&'$F+2'":0"#$123'-*Q;'$.'3"$'$I1#"'+;'

O+;$9'":0"#$'$.'F+<"'$I1#"'+;'/+28U

! *+&+,,",'0&.#";;123'O.&'$F"'/+;;";

! Q2O.&$%2+$",8)'*+&+,,",'0&.3&+//123'1;'F+&6V''

"D,3.&1$F/;'+26'B+$+'?$&%#$%&";'/%;$'C"'O%26+/"2$+,,8'&"6";132"6

slide by Matthew Bolitho

Motivation

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

F

!"#$%&"'()'*+&+,,",'-./0%$123

!"#"$$%$&'()*+,-./&-0&"&1(#)&(1&'()*+,-./&-.&

23-'3&)".4&-.0,#+',-(.0&"#%&'"##-%5&(+,&

0-)+$,".%(+0$441510"61+

! 7&+61$1.2+,,8)',+&3"9'":0"2;1<"9';0"#1+,1="6

! >:.$1#'?%0"&#./0%$"&;'@>3)'-&+8A

! B1;$&1C%$"6'?8;$"/;'@>3)'D?-E'4F1$"9'G,%"H"2"A'

! *+&+,,",'#./0%$123'I+;'$&+61$1.2+,,8'

12+##";;1C,"'$.'$F"'#.//.61$8'/+&5"$0,+#"

!"#$%&'()*$+,-.%/'0%(,1,(2(%&'()'1$1-%&'3-3%#43%

-.'#%"0%5&",&"&#",%&(1&#(+/3$4&"&1"',(#&(1&,2(&*%#&4%"#&666&

7%#,"-.$4&(8%#&,3%&03(#,&,%#)&,3-0&#",%&'".&9%&%:*%',%5&,(&

'(.,-.+%;&-1&.(,&,(&-.'#%"0%6&<8%#&,3%&$(./%#&,%#);&,3%&

#",%&(1&-.'#%"0%&-0&"&9-,&)(#%&+.'%#,"-.;&"$,3(+/3&,3%#%&-0&

.(&#%"0(.&,(&9%$-%8%&-,&2-$$&.(,&#%)"-.&.%"#$4&'(.0,".,&1(#&

",&$%"0,&=>&4%"#06&?3",&)%".0&94&=@AB;&,3%&.+)9%#&(1&

'()*(.%.,0&*%#&-.,%/#",%5&'-#'+-,&1(#&)-.-)+)&'(0,&2-$$&

9%&CB;>>>6&D&9%$-%8%&,3",&0+'3&"&$"#/%&'-#'+-,&'".&9%&9+-$,&

'1%4%3,15*$%64/$07

H.&6.2'J..&"9'>,"#$&.21#;'J+3+=12"9'KL'D0&1,'KLMN

! 7F"'/.;$'"#.2./1#'2%/C"&'.O'#./0.2"2$;'

12'+2'E-'I1,,'6.%C,"'"<"&8'8"+&

! P1;$.&1#+,,8'!-*Q;'3"$'O+;$"&

"P+&6I+&"'&"+#F123'O&"R%"2#8',1/1$+$1.2;

! S.I'!-*Q;'3"$'I16"&

! T+$F"&'$F+2'":0"#$123'-*Q;'$.'3"$'$I1#"'+;'

O+;$9'":0"#$'$.'F+<"'$I1#"'+;'/+28U

! *+&+,,",'0&.#";;123'O.&'$F"'/+;;";

! Q2O.&$%2+$",8)'*+&+,,",'0&.3&+//123'1;'F+&6V''

"D,3.&1$F/;'+26'B+$+'?$&%#$%&";'/%;$'C"'O%26+/"2$+,,8'&"6";132"6

Motivation


Thinking Parallel

Getting your feet wet

• Common scenario: “I want to make the algorithm X run faster, help me!”

• Q: How do you approach the problem?

How?

• Option 1: wait

• Option 2: gcc -O3 -msse4.2

• Option 3: xlc -O5

• Option 4: use parallel libraries (e.g. (cu)blas)

• Option 5: hand-optimize everything!

• Option 6: wait more

What else ?

How about analysis ?


0

25

50

75

100

load_data() foo() bar() yey()

50

1110

29

time

(s)

Algorithm X v1.0 Profiling Analysis on Input 10x10x10

sequential in nature

100% parallelizable

Q: What is the maximum speed up ?


0

25

50

75

100


50

1110

29

time

(s)



100% parallelizable

A: 2X ! :-(


0

2,250

4,500

6,750

9,000


9,000

300250350

time

(s)



100% parallelizable

Q: and now?

You need to...

• ... understand the problem (duh!)

• ... study the current (sequential?) solutions and their constraints

• ... know the input domain

• ... profile accordingly

• ... “refactor” based on new constraints (hw/sw)

A better way ?

Speculation: (input) domain-aware optimization using some sort of probabilistic modeling ?

...

doesn’t scale !

9 Some Perspective

Technical Problem to be Analyzed

Direct elimination equation solver

Discretization "A"

Scientific Model "A"

Sequential implementationParallel implementation

Iterative equation solver

Discretization "B"

Consultation with experts

Model "B"

Experiments

Theoretical analysis

Figure 11: The “problem tree” for scientific problem solving. There are many

options to try to achieve the same goal.

56

from Scott et al. “Scientific Parallel Computing” (2005)

Some PerspectiveThe “problem tree” for scientific problem solving

There are many options to try to achieve the same goal.

Computational Thinking

• translate/formulate domain problems into computational models that can be solved efficiently by available computing resources

• requires a deep understanding of their relationships

adapted from Hwu & Kirk (PASI 2011)

Architecture Algorithms

ParallelComputing

Languages

APPLICATIONS

Figure 3: Knowledge of algorithms, architecture, and languages contributes to ef-

fective use of parallel computers in practical applications.

9

adapted from Scott et al. “Scientific Parallel Computing” (2005)

Getting ready...

CompilersPatterns

Programming Models

Parallel Thinking

Fundamental Skills

• Computer architecture

• Programming models and compilers

• Algorithm techniques and patterns

• Domain knowledge

Computer Architecture

• memory organization, bandwidth and latency; caching and locality (memory hierarchy)

• floating-point precision vs. accuracy

• SISD, SIMD, MISD, MIMD vs. SIMT, SPMD

critical in understanding tradeoffs btw algorithms

Programming models

• parallel execution models (threading hierarchy)

• optimal memory access patterns

• array data layout and loop transformations

for optimal data structure and code execution

Algorithms and patterns

• toolbox for designing good parallel algorithms

• it is critical to understand their scalability and efficiency

• many have been exposed and documented

• sometimes hard to “extract”

• ... but keep trying!

Domain Knowledge

• abstract modeling

• mathematical properties

• accuracy requirements

• coming back to the drawing board to expose more/better parallelism ?

You can do it!

• thinking parallel is not as hard as you may think

• many techniques have been thoroughly explained...

• ... and are now “accessible” to non-experts !

Architecture

Architecture

• What’s in a (basic) computer?

• Basic Subsystems

• Machine Language

• Memory Hierarchy

• Pipelines

• CPUs to GPUs

What’s in a computer?

Processor

Intel Q6600 Core2 Quad, 2.4 GHz

Die

(2×) 143 mm2, 2× 2 cores

582,000,000 transistors

∼ 100W

Memory

Intro Basics Assembly Memory Pipelinesadapted from Berger & Klöckner (NYU 2010)

http://mathema.tician.de/aboutme/


Architecture





• Pipelines

A Basic Processor

Internal Bus

Register FileFlags

Data ALU

Address ALU

Control UnitPC

Memory Interface

Insn.fetch

Data Bus

Address Bus

(loosely based on Intel 8086)

Bonus Question:What’s a bus?




How all of this fits together

Everything synchronizes to the Clock.

Control Unit (“CU”): The brains of theoperation. Everything connects to it.

Bus entries/exits are gated and(potentially) buffered.

CU controls gates, tells other unitsabout ‘what’ and ‘how’:

• What operation?

• Which register?

• Which addressing mode?

Internal Bus

Register FileFlags

Data ALU

Address ALU

Control UnitPC

Memory Interface

Insn.fetch

Data Bus

Address Bus




What is. . . an ALU?

Arithmetic Logic Unit

One or two operands A, B

Operation selector (Op):

• (Integer) Addition, Subtraction

• (Logical) And, Or, Not

• (Bitwise) Shifts (equivalent to

multiplication by power of two)

• (Integer) Multiplication, Division

Specialized ALUs:

• Floating Point Unit (FPU)

• Address ALU

Operates on binary representations of

numbers. Negative numbers represented

by two’s complement.




What is. . . a Register File?

Registers are On-Chip Memory

• Directly usable as operands in

Machine Language

• Often “general-purpose”

• Sometimes special-purpose: Floating

point, Indexing, Accumulator

• Small: x86 64: 16×64 bit GPRs

• Very fast (near-zero latency)

%r0

%r1

%r2

%r3

%r4

%r5

%r6

%r7




How does computer memory work?

One (reading) memory transaction (simplified):

Processor Memory

CLK

R/W

A0..15

D0..15

Observation: Access (and addressing) happens

in bus-width-size “chunks”.




What is. . . a Memory Interface?

Memory Interface gets and stores binarywords in off-chip memory.

Smallest granularity: Bus width

Tells outside memory

• “where” through address bus

• “what” through data bus

Computer main memory is “Dynamic RAM”(DRAM): Slow, but small and cheap.




Architecture





• Pipelines

• CPUs to GPUs

A Very Simple Program

int a = 5;int b = 17;int z = a ∗ b;

4: c7 45 f4 05 00 00 00 movl $0x5,−0xc(%rbp)b: c7 45 f8 11 00 00 00 movl $0x11,−0x8(%rbp)12: 8b 45 f4 mov −0xc(%rbp),%eax15: 0f af 45 f8 imul −0x8(%rbp),%eax19: 89 45 fc mov %eax,−0x4(%rbp)1c: 8b 45 fc mov −0x4(%rbp),%eax

Things to know:

• Addressing modes (Immediate, Register, Base plus Offset)

• 0xHexadecimal

• “AT&T Form”: (we’ll use this)<opcode><size> <source>, <dest>




A Very Simple Program: Intel Form

4: c7 45 f4 05 00 00 00 mov DWORD PTR [rbp−0xc],0x5

b: c7 45 f8 11 00 00 00 mov DWORD PTR [rbp−0x8],0x11

12: 8b 45 f4 mov eax,DWORD PTR [rbp−0xc]

15: 0f af 45 f8 imul eax,DWORD PTR [rbp−0x8]

19: 89 45 fc mov DWORD PTR [rbp−0x4],eax

1c: 8b 45 fc mov eax,DWORD PTR [rbp−0x4]

• “Intel Form”: (you might see this on the net)<opcode> <sized dest>, <sized source>

• Goal: Reading comprehension.

• Don’t understand an opcode?Google “<opcode> intel instruction”.




Machine Language Loops

int main(){int y = 0, i ;for ( i = 0;

y < 10; ++i)y += i;

return y;}

0: 55 push %rbp1: 48 89 e5 mov %rsp,%rbp4: c7 45 f8 00 00 00 00 movl $0x0,−0x8(%rbp)b: c7 45 fc 00 00 00 00 movl $0x0,−0x4(%rbp)12: eb 0a jmp 1e <main+0x1e>14: 8b 45 fc mov −0x4(%rbp),%eax17: 01 45 f8 add %eax,−0x8(%rbp)1a: 83 45 fc 01 addl $0x1,−0x4(%rbp)1e: 83 7d f8 09 cmpl $0x9,−0x8(%rbp)22: 7e f0 jle 14 <main+0x14>24: 8b 45 f8 mov −0x8(%rbp),%eax27: c9 leaveq28: c3 retq

Things to know:

• Condition Codes (Flags): Zero, Sign, Carry, etc.

• Call Stack: Stack frame, stack pointer, base pointer

• ABI: Calling conventions

Want to make those yourself?Write myprogram.c.$ cc -c myprogram.c

$ objdump --disassemble myprogram.o




We know how a computer works!

All of this can be built in about 4000 transistors.

(e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)

So what exactly is Intel doing with the other 581,996,000

transistors?

Answer:

Make things go faster!

Goal now:Understand sources of slowness, and how they get addressed.

Remember: High Performance Computing




We know how a computer works!

All of this can be built in about 4000 transistors.

(e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)

So what exactly is Intel doing with the other 581,996,000

transistors?

Answer: Make things go faster!

Goal now:Understand sources of slowness, and how they get addressed.

Remember: High Performance Computing




The High-Performance Mindset

Writing high-performance Codes

Mindset: What is going to be the limitingfactor?

• ALU?

• Memory?

• Communication? (if multi-machine)

Benchmark the assumed limiting factor rightaway.

Evaluate

• Know your peak throughputs (roughly)

• Are you getting close?

• Are you tracking the right limiting factor?




Architecture





• Pipelines

• CPUs to GPUs

Source of Slowness: MemoryMemory is slow.

Distinguish two different versions of “slow”:• Bandwidth• Latency

→ Memory has long latency, but can have large bandwidth.

Size of die vs. distance to memory: big!

Dynamic RAM: long intrinsic latency!

Idea:

Put a look-up table ofrecently-used data ontothe chip.

→ “Cache”




The Memory Hierarchy

Hierarchy of increasingly bigger, slower memories:

Registers

L1 Cache

L2 Cache

DRAM

Virtual Memory

(hard drive)

1 kB, 1 cycle

10 kB, 10 cycles

1 MB, 100 cycles

1 GB, 1000 cycles

1 TB, 1 M cycles

How might data localityfactor into this?

What is a working set?


faster

bigger



Enti

re p

rob

lem

fit

s w

ith

in r

egis

ters

Enti

re p

rob

lem

fit

s w

ith

in c

ach

e

Enti

re p

rob

lem

fits

wit

hin

mai

n m

emo

ry

Pro

ble

mre

qu

ires

seco

nd

ary

(dis

k)m

emo

ry

Problem too bigfor system!P

erf

orm

an

ce o

f co

mp

ute

r sy

ste

m

Size of problem being solved

Pe

rfo

rma

nce

of

com

pu

ter

syst

em

Size of problem being solved

Figure 6: Hypothetical model of performance of a computer having a hierarchy of

memory systems (registers, cache, main memory, and disk).

15


Impact on Performance

The Memory Hierarchy

Hierarchy of increasingly bigger, slower memories:

Registers

L1 Cache

L2 Cache

DRAM

Virtual Memory

(hard drive)

1 kB, 1 cycle

10 kB, 10 cycles

1 MB, 100 cycles

1 GB, 1000 cycles

1 TB, 1 M cyclesHow might data localityfactor into this?

What is a working set?




Cache: Actual Implementation

Demands on cache implementation:

• Fast, small, cheap, low-power

• Fine-grained

• High “hit”-rate (few “misses”)

Problem:Goals at odds with each other: Access matching logic expensive!

Solution 1: More data per unit of access matching logic→ Larger “Cache Lines”

Solution 2: Simpler/less access matching logic→ Less than full “Associativity”

Other choices: Eviction strategy, size




Cache: Associativity

Direct Mapped

Memory

0

1

2

3

4

5

6...

Cache

0

1

2

3

2-way set associative

Memory

0

1

2

3

4

5

6...

Cache

0

1

2

3

Miss rate versus cache size on the Integer por-

tion of SPEC CPU2000 [Cantin, Hill 2003]




Cache Example: Intel Q6600/Core2 Quad

--- L1 data cache ---fully associative cache = falsethreads sharing this cache = 0x0 (0)processor cores on this die= 0x3 (3)system coherency line size = 0x3f (63)ways of associativity = 0x7 (7)number of sets - 1 (s) = 63

--- L1 instruction ---fully associative cache = falsethreads sharing this cache = 0x0 (0)processor cores on this die= 0x3 (3)system coherency line size = 0x3f (63)ways of associativity = 0x7 (7)number of sets - 1 (s) = 63

--- L2 unified cache ---fully associative cache falsethreads sharing this cache = 0x1 (1)processor cores on this die= 0x3 (3)system coherency line size = 0x3f (63)ways of associativity = 0xf (15)number of sets - 1 (s) = 4095

More than you care to know about your CPU:http://www.etallen.com/cpuid.html




Measuring the Cache I

void go(unsigned count, unsigned stride){const unsigned arr size = 64 ∗ 1024 ∗ 1024;int ∗ary = (int ∗) malloc(sizeof( int) ∗ arr size );

for (unsigned it = 0; it < count; ++it){for (unsigned i = 0; i < arr size ; i += stride)ary [ i ] ∗= 17;

}

free (ary );}




Measuring the Cache II

void go(unsigned array size , unsigned steps){int ∗ary = (int ∗) malloc(sizeof( int) ∗ array size );unsigned asm1 = array size − 1;

for (unsigned i = 0; i < steps; ++i)ary [( i∗16) & asm1] ++;

free (ary );}




Measuring the Cache III

void go(unsigned array size , unsigned stride , unsigned steps){char ∗ary = (char ∗) malloc(sizeof( int) ∗ array size );

unsigned p = 0;for (unsigned i = 0; i < steps; ++i){ary [p] ++;p += stride;if (p >= array size)p = 0;

}

free (ary );}




Mike Bauer (Stanford)

Tue 4/5/11: Guest Lecture by Mike Bauer (Stanford)

http://sequoia.stanford.edu/



Architecture





• Pipelines

• CPUs to GPUs

Source of Slowness: Sequential Operation

IF Instruction fetch

ID Instruction Decode

EX Execution

MEM Memory Read/Write

WB Result Writeback




Solution: Pipelining




Pipelining

(MIPS, 110,000 transistors)




Issues with Pipelines

Pipelines generally helpperformance–but not always.

Possible issues:

• Stalls

• Dependent Instructions

• Branches (+Prediction)

• Self-Modifying Code

“Solution”: Bubbling, extracircuitry




Intel Q6600 Pipeline

New concept:Instruction-levelparallelism(“Superscalar”)




Intel Q6600 PipelineNew concept:Instruction-levelparallelism(“Superscalar”)




Programming for the Pipeline

How to upset a processor pipeline:

for ( int i = 0; i < 1000; ++i)

for ( int j = 0; j < 1000; ++j)

{if ( j % 2 == 0)

do something(i , j );

}

. . . why is this bad?




A Puzzle

int steps = 256 ∗ 1024 ∗ 1024;int [] a = new int[2];

// Loop 1for ( int i=0; i<steps; i++) { a[0]++; a[0]++; }

// Loop 2for ( int i=0; i<steps; i++) { a[0]++; a[1]++; }

Which is faster?

. . . and why?




Two useful Strategies

Loop unrolling:

for ( int i = 0; i < 1000; ++i)do something(i );

→for ( int i = 0; i < 500; i+=2){do something(i );do something(i+1);

}

Software pipelining:

for ( int i = 0; i < 1000; ++i){do a( i );do b(i );

}

→

for ( int i = 0; i < 500; i+=2){do a( i );do a( i+1);do b(i );do b(i+1);

}




SIMD

Control Units are large and expensive.

Functional Units are simple and cheap.

→ Increase the Function/Control ratio:

Control several functional units with

one control unit.

All execute same operation.

Dat

a Po

ol

Instruction PoolSIMD

GCC vector extensions:

typedef int v4si attribute (( vector size (16)));

v4si a, b, c;

c = a + b;

// +, −, ∗, /, unary minus, ˆ, |, &, ˜, %

Will revisit for OpenCL, GPUs.




Architecture





• Pipelines

• CPUs to GPUs

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

EFG$F/$$0

&

! !"#$%&$'()*(+,-.'/

!012('&.*2(3'45&*)&6,7'&"2'89':&%;<=&;>6?&;*2(4'& !012('&.*2(3'45&*)&6,7'&"2'89':&%;<=&;>6?&;*2(4'&

! 6'401-'@&)*(&+,3AB0-3'-407':&C,(,DD'D&

C(*8D'+4/

! E*('&3(,-4043*(4&@'@0.,3'@&3*&?">&3A,-&)D*F&

.*-3(*D&,-@&@,3,&.,.A'

! GA,3&,('&3A'&.*-4'H2'-.'4I

! GA,3&,('&3A'&.*-4'H2'-.'4I

! $(*1(,+&+243&8'&+*('&C('@0.3,8D'/

! 6,3,&,..'44&.*A'('-.5

! $(*1(,+&)D*F

GPUs ?

Intro PyOpenCL What and Why? OpenCL

“CPU-style” Cores

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

CPU-“style” cores

ALU (Execute)

Fetch/ Decode

Execution Context

Out-of-order control logic

Fancy branch predictor

Memory pre-fetcher

Data cache (A big one)

13

Credit: Kayvon Fatahalian (Stanford)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA


Slimming down


Slimming down

ALU (Execute)

Fetch/ Decode

Execution Context

Idea #1:

Remove components that help a single instruction stream run fast

14


Andreas Klockner GPU-Python with PyOpenCL and PyCUDAslide by


More Space: Double the Number of Cores


Two cores (two fragments in parallel)

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

!"#$$%&'()*"'+,-.

&*/01'.+23.453.623.&2.

/%1..+73.423.892:2;.

/*"".+73.4<3.892:<;3.+7.

/*"".+73.4=3.892:=;3.+7.

81/0.+73.+73.1>[email protected]><?2@.

/%1..A23.+23.+7.

/%1..A<3.+<3.+7.

/%1..A=3.+=3.+7.

/A4..A73.1><?2@.

fragment 1

!"#$$%&'()*"'+,-.

&*/01'.+23.453.623.&2.

/%1..+73.423.892:2;.

/*"".+73.4<3.892:<;3.+7.

/*"".+73.4=3.892:=;3.+7.

81/0.+73.+73.1>[email protected]><?2@.

/%1..A23.+23.+7.

/%1..A<3.+<3.+7.

/%1..A=3.+=3.+7.

/A4..A73.1><?2@.

fragment 2

15




. . . again


Four cores (four fragments in parallel)

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

ALU (Execute)

Fetch/ Decode

Execution Context

16




. . . and again


Sixteen cores (sixteen fragments in parallel)

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

ALU ALU

16 cores = 16 simultaneous instruction streams 17 Credit: Kayvon Fatahalian (Stanford)

→ 16 independent instruction streams

Reality: instruction streams not actuallyvery different/independent



Saving Yet More Space


Recall: simple processing core

Fetch/ Decode

ALU (Execute)

Execution Context

19 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/


Fetch/ Decode

ALU (Execute)

Execution Context

19

Idea #2

Amortize cost/complexity ofmanaging an instruction streamacross many ALUs

→ SIMD






Add ALUs

Fetch/ Decode

Idea #2:

Amortize cost/complexity of managing an instruction stream across many ALUs

ALU 1 ALU 2 ALU 3 ALU 4


SIMD processing Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data



Fetch/ Decode

ALU (Execute)

Execution Context

19

Idea #2


→ SIMD






Add ALUs

Fetch/ Decode

Idea #2:





Ctx Ctx Ctx Ctx

Shared Ctx Data


Add ALUs

Fetch/ Decode

Idea #2:





Ctx Ctx Ctx Ctx

Shared Ctx Data

20

Idea #2


→ SIMD




Gratuitous Amounts of Parallelism!


128 fragments in parallel

= 16 simultaneous instruction streams 16 cores = 128 ALUs

24 Credit: Kayvon Fatahalian (Stanford)

Example:

128 instruction streams in parallel16 independent groups of 8 synchronized streams


http://www.youtube.com/watch?v=1yH_j8-VVLo




Gratuitous Amounts of Parallelism!


128 fragments in parallel

= 16 simultaneous instruction streams 16 cores = 128 ALUs

24 Credit: Kayvon Fatahalian (Stanford)

Example:

128 instruction streams in parallel16 independent groups of 8 synchronized streams

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA


slide by




Remaining Problem: Slow Memory

Problem

Memory still has very high latency. . .. . . but we’ve removed most of thehardware that helps us deal with that.

We’ve removed

caches

branch prediction

out-of-order execution

So what now?

Idea #3

Even more parallelism+ Some extra memory= A solution!




Problem


We’ve removed

caches

branch prediction


So what now?SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls Time

(clocks) Frag 1 … 8

Fetch/ Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

ALU ALU ALU ALU

ALU ALU ALU ALU

33

Idea #3





Problem


We’ve removed

caches

branch prediction


So what now?SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stalls Time

(clocks)

Fetch/ Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

1 2

3 4

1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

34

Idea #3




GPU Architecture Summary

Core Ideas:

1 Many slimmed down cores→ lots of parallelism

2 More ALUs, Fewer Control Units

3 Avoid memory stalls by interleavingexecution of SIMD groups(“warps”)



!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

EFG$F/$$0

&

! !"#$%&$'()*(+,-.'/

!012('&.*2(3'45&*)&6,7'&"2'89':&%;<=&;>6?&;*2(4'& !012('&.*2(3'45&*)&6,7'&"2'89':&%;<=&;>6?&;*2(4'&

! 6'401-'@&)*(&+,3AB0-3'-407':&C,(,DD'D&

C(*8D'+4/

! E*('&3(,-4043*(4&@'@0.,3'@&3*&?">&3A,-&)D*F&

.*-3(*D&,-@&@,3,&.,.A'

! GA,3&,('&3A'&.*-4'H2'-.'4I

! GA,3&,('&3A'&.*-4'H2'-.'4I

! $(*1(,+&+243&8'&+*('&C('@0.3,8D'/

! 6,3,&,..'44&.*A'('-.5

! $(*1(,+&)D*F


Is it free?

Some terminology

Some More TerminologyOne way to classify machines distinguishes betweenshared memory global memory can be acessed by all processors orcores. Information exchanged between threads using shared variableswritten by one thread and read by another. Need to coordinate access toshared variables.distributed memory private memory for each processor, only accessiblethis processor, so no synchronization for memory accesses needed.Information exchanged by sending data from one processor to anothervia an interconnection network using explicit communication operations.

Interconnection Network

P PP

M M M


P PP

M M M

Hybrid approach increasingly common



P PP

M M M


P PP

M M M

Hybrid approach increasingly commonnow: mostly hybrid

“distributed memory” “shared memory”

Some terminologySome More Terminology

One way to classify machines distinguishes betweenshared memory global memory can be acessed by all processors orcores. Information exchanged between threads using shared variableswritten by one thread and read by another. Need to coordinate access toshared variables.distributed memory private memory for each processor, only accessiblethis processor, so no synchronization for memory accesses needed.Information exchanged by sending data from one processor to anothervia an interconnection network using explicit communication operations.


P PP

M M M


P PP

M M M




P PP

M M M


P PP

M M M


Programming Model(Overview)

GPU Architecture

CUDA Programming Model


Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axis1

HardwareSoftware representation

?

Really: Group providespool of parallelism to drawfrom.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.




Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:





or


Axis 0

Axis1


Grid

(Kernel: Func-

tion on Grid)

(Work) Group

(Work) Item

?





or “Block”

or “Thread”



Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:





or


Axis 0

Axis1


Grid

(Kernel: Func-

tion on Grid)

?







Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:





or


Axis 0

Axis1


Grid

(Kernel: Func-

tion on Grid)

(Work) Group

?





or “Block”



Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:





or


Axis 0

Axis1


?







Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:





or


Axis 0

Axis1


?





Block

block



Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares ho

w

manycore

s?

Idea:





or


Axis 0

Axis1


?





more next time ;-)

Bits of Theory(or “common sense”)

Speedup

• T(1): Performance of best serial algorithm

• p: Number of processors

•

S(p) =T (1)T (p)

S(p) ≤ p

Peter Arbenz, Andreas Adelmann, ETH Zurich

Efficiency

• Fraction of time for which a processor does useful work

• means

E(p) =S(p)

p=

T (1)pT (p)

E(p) ≤ 1S(p) ≤ p


Amdahl’s Law


• : Fraction of program that is sequential

• Assumes that the non-sequential portion of the program parallelizes optimally

T (p) =�

α +1− α

p

�T (1)

α

Example

• Sequential portion: 10 sec

• Parallel portion: 990 sec

• What is the maximal speedup as ? p→∞

Solution

• Sequential fraction of the code:

• Amdahl’s Law:

• Speedup as

1010 + 990

=1

100= 1%

T (p) =�

0.01 +0.99p

�T (1)

p→∞

S(p) =T (1)T (p)

→ 1α

= 100

Arithmetic Intensity

• : computational Work in floating-point operations

• : number of Memory accesses (read and write)

• Memory access is the critical issue!

Example4.1 Memory effects

Memory access is the critical issue in high-performance computing.

Definition 4.2 The work/memory ratio !WM: number of floating-point operations

divided by number of memory locations referenced (either reads or writes).

A look at a book of mathematical tables tells us that

"

4= 1 ! 1

3+

1

5! 1

7+

1

9! 1

11+

1

13! 1

15+ · · · (4.1)

Slowly converging series good example for studying basic operation of computing

the sum of a series of numbers:

A =N!

i=1

ai. (4.2)

Computation of A in equation (4.2) requires N ! 1 floating-point additions and

involves N + 1 memory locations: one for A and n for the ai’s.

Therefore, work/memory ratio for this algorithm is !WM = (N ! 1)/(N + 1) " 1

for large N .

12

0 5 10 15 20 25 300

5

10

15

20

25

30Speed!up of simple Pi summation

Number of Processors

Sp

ee

d!

up

Figure 9: Hypothetical performance of a parallel implementation of summation:

speed-up.

28

Why?


0 5 10 15 20 25 300.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1Parallel efficiency of simple Pi summation

Number of Processors

Eff

icie

ncy

Figure 10: Hypothetical performance of a parallel implementation of summation:

efficiency.

31

Why?


Example

Computationdone here Pathway to memory

Main datastored here

Figure 4: A simple memory model with a computational unit with only a small

amount of local memory (not shown) separated from the main memory by a path-

way with limited bandwidth µ.

Theorem 4.1 Suppose that a given algorithm has a work/memory ratio !WM, and

it is implemented on a system as depicted in Figure 4 with a maximum bandwidth

to memory of µ billion floating-point words per second. Then the maximum

performance that can be achieved is µ!WM GFLOPS.

Theorem 4.1 provides an upper bound on the number of operations per unit time,

by assuming the floating-point operation blocks until data are available to the cpu.

Therefore the cpu cannot proceed faster than the rate data are supplied, and it

might proceed slower.

13

• Q: How many float32 ops / sec maximum ?

• Processing unit can’t be faster than the rate data are supplied, and it might be slower

Bandwidth = 1 Gbyte / sec

Better?



Local data cache here


Figure 5: A memory model with a large local data cache separated from the main

memory by a pathway with limited bandwidth µ.

The performance of a two-level memory model (as depicted in Figure 5)

consisting of a cache and a main memory can be modeled simplistically as

average cycles

word access=%hits! cache cycles

word access

+ (1 - %hits)! main memory cycles

word access,

(4.3)

where %hits is the fraction of cache hits among all memory references.

Figure 6 indicates the performance of a hypothetical application, depicting a

decrease in performance as a problem increases in size and migrates into ever

slower memory systems. Eventually the problem size reaches a point where it can

not ever be completed for lack of memory.

14

• Yes? In theory... Why?

• No? Why?

Cache Performance









average cycles


word access


word access,

(4.3)






14









average cycles


word access


word access,

(4.3)






14


COMP 322, Fall 2009 (V.Sarkar)!16

Cache Performance

16

!"#"$%$&'"()(*+,%-",.("

!*/*'("#",.("0%1"&"23-4'("51%*(22%1"*/*'("

6*%+-$"#"$%$&'"-+.7(1"%0"3-2$1+*,%-2"689:"#"-+.7(1"%0"89:"3-2$1+*,%-2";(<4<"1(432$(1"="1(432$(1>"

6?@?"#"-+.7(1"%0".(.%1/"&**(22"3-2$1+*,%-2";"(<4<"'%&AB"2$%1(>"

CD6"#"&E(1&4("*/*'(2"5(1"3-2$1+*,%-2"

CD689:"#"&E(1&4("*/*'(2"5(1"89:"3-2$1+*,%-2"

CD6?@?"#"&E(1&4("*/*'(2"5(1".(.%1/"3-2$1+*,%-"

1.322"#"*&*F(".322"1&$("

1F3$"#"*&*F("F3$"1&$("CD6?@?G?6HH"#"*/*'(2"5(1"*&*F(".322"

CD6?@?GI6!#*/*'(2"5(1"*&*F("F3$"

?89:"#"3-2$1+*,%-".3)"0%1"89:"3-2$1+*,%-2"

??@?"#"3-2$1+*,%-".3)"0%1".(.%1/"&**(22"3-2$1+*,%-"

Cache Performance

from V. Sarkar (COMP 322, 2009)


Cache Performance: Example

17

from V. Sarkar (COMP 322, 2009)

Cache Performance

Parallel Complexity


Algorithmic Complexity Measures!

TP = execution time on P processors Computation graph abstraction: •! Node = arbitrary sequential computation •! Edge = dependence (successor node can only execute after predecessor node has completed) •! Directed acyclic graph (dag)

Processor abstraction: •! P identical processors •! Each processor executes one node at a time PROC0 PROCP-1 . . .

= execution time on processors

Computation graph abstraction (DAG):Node: arbitrary sequential computationEdge: dependence

Assume: identical processorsexecuting one node at a time

adapted from V. Sarkar (COMP 322, 2009)

Parallel Complexity








TP = execution time on P processors

T1 = work “work complexity”

total number of operations performed


Parallel Complexity






COMP 322, Fall 2009 (V.Sarkar)!17 July 13, 2006 17



T1 = work

T! = span*

*Also called critical-path length or computational depth.




T1 = work

T! = span*


* also called:critical path length or computational depth




T1 = work

T! = span*


“work complexity”

“step complexity”

minimum number of steps when


Parallel Complexity






Lower bounds:


Parallel Complexity= execution time on processors

Parallelism (i.e ideal speed-up):


Speedup!

If T1/TP = !(P), we have linear speedup; = P, we have perfect linear speedup; > P, we have superlinear speedup,

Superlinear speedup is not possible in this model because of the lower bound TP ! T1/P, but superlinear speedup can be possible in practice (as we will see later in the course)




T1 = work

T! = span*



ExampleArray Sum: Sequential Version


Example 1: Array Sum !(sequential version)"

•! Problem: compute the sum of the elements X[0] … X[n-1] of array X

•! Sequential algorithm —! sum = 0; for ( i=0 ; i< n ; i++ ) sum += X[i];

•! Computation graph

—! Work = O(n), Span = O(n), Parallelism = O(1)

•! How can we design an algorithm (computation graph) with more parallelism?

+

+

+

X[0]

X[1]

X[2]

…

0


Example 1: Array Sum !(sequential version)"

•! Problem: compute the sum of the elements X[0] … X[n-1] of array X

•! Sequential algorithm —! sum = 0; for ( i=0 ; i< n ; i++ ) sum += X[i];

•! Computation graph

—! Work = O(n), Span = O(n), Parallelism = O(1)

•! How can we design an algorithm (computation graph) with more parallelism?

+

+

+

X[0]

X[1]

X[2]

…

0


ExampleArray Sum: Parallel Iterative Version



Example 1: Array Sum !(parallel iterative version)"

•! Computation graph for n = 8

•! Work = O(n), Span = O(log n), Parallelism = O( n / (log n) )

+

X[2] X[3]

+

X[0] X[1]

+

X[4] X[5]

+

X[6] X[7]

X[0] X[2] X[4] X[6]

+ +

X[0]

X[4]

+

X[0]

Extra dependence edges due to forall construct

ExampleArray Sum: Parallel Recursive Version



Example 1: Array Sum !(parallel recursive version)"

•! Computation graph for n = 8

•! Work = O(n), Span = O(log n), Parallelism = O( n / (log n) )

•! No extra dependences as in forall case

+

X[2] X[3]

+

X[0] X[1]

+

X[4] X[5]

+

X[6] X[7]

+ +

+

Patterns

Task vs Data Parallelism

Task parallelism

• Distribute the tasks across processors based on dependency

• Coarse-grain parallelism

157

Task 1Task 2

Task 4Task 5 Task 6

Task 7 Task 8Task 9

Task 3

Task dependency graph

Task assignment across 3 processors

Task 1

Task 4

Task 7

Task 5

Task 8

Task 2

Task 6

Task 3

Task 9

P1

P2

P3

Time

Data parallelism

• Run a single kernel over many elements–Each element is independently updated–Same operation is applied on each element

• Fine-grain parallelism–Many lightweight threads, easy to switch context–Maps well to ALU heavy architecture : GPU

158

Kernel P1 P2 P3 P4 P5 Pn…….

…….Data

4

Task vs. Data parallelismTask vs. Data parallelism

• Task parallel

– Independent processes with little communication

– Easy to use

• “Free” on modern operating systems with SMP

• Data parallel

– Lots of data on which the same computation is being

executed

– No dependencies between data elements in each

step in the computation

– Can saturate many ALUs

– But often requires redesign of traditional algorithms

slide by Mike Houston

5

CPU vs. GPUCPU vs. GPU

• CPU

– Really fast caches (great for data reuse)

– Fine branching granularity

– Lots of different processes/threads

– High performance on a single thread of execution

• GPU

– Lots of math units

– Fast access to onboard memory

– Run a program on each fragment/vertex

– High throughput on parallel tasks

• CPUs are great for task parallelism

• GPUs are great for data parallelismslide by Mike Houston

GPU-friendly Problems

• Data-parallel processing• High arithmetic intensity

–Keep GPU busy all the time–Computation offsets memory latency

• Coherent data access–Access large chunk of contiguous memory–Exploit fast on-chip shared memory

161

The Algorithm Matters

• Jacobi: Parallelizable

for(int i=0; i<num; i++) { vn+1[i] = (vn[i-1] + vn[i+1])/2.0; }

• Gauss-Seidel: Difficult to parallelize

for(int i=0; i<num; i++) { v[i] = (v[i-1] + v[i+1])/2.0; }

162

Example: Reduction

• Serial version (O(N)) for(int i=1; i<N; i++) { v[0] += v[i]; }

• Parallel version (O(logN)) width = N/2; while(width > 1) { for(int i=0; i<width; i++) { v[i] += v[i+width]; // computed in parallel } width /= 2; }

163

6

The Importance of Data Parallelism for GPUsThe Importance of Data Parallelism for GPUs

• GPUs are designed for highly parallel tasks like

rendering

• GPUs process independent vertices and fragments

– Temporary registers are zeroed

– No shared or static data

– No read-modify-write buffers

– In short, no communication between vertices or fragments

• Data-parallel processing

– GPU architectures are ALU-heavy

• Multiple vertex & pixel pipelines

• Lots of compute power

– GPU memory systems are designed to stream data

• Linear access patterns can be prefetched

• Hide memory latency slide by Mike Houston

GPUs

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

F

!"!# !"$#

$"!# $"$#

!%&'() $*(+%,()

!%&'()

$*(+%,()

#-+-

"&.+/*0+%1& !"!# !"$#

$"!# $"$#

!%&'() $*(+%,()

!%&'()

$*(+%,()

#-+-

"&.+/*0+%1&

!"!# !"$#

$"!# $"$#

!%&'() $*(+%,()

!%&'()

$*(+%,()

#-+-

"&.+/*0+%1& !"!# !"$#

$"!# $"$#

!%&'() $*(+%,()

!%&'()

$*(+%,()

#-+-

"&.+/*0+%1&

!"!# !"$#

$"!# $"$#

!%&'() $*(+%,()

!%&'()

$*(+%,()

"&.+/*0+%1&

"&.+/*0+%1&

$"$#

!"#$%&'(%)*$+

!(, -.(/

0123$1453%&'(%)*$+

(,, 67523%$2

8+4$1& 9$1&


Flynn’s TaxonomyEarly classification of parallel computing architectures given by M.Flynn (1972) using number of instruction streams and data streams.Still used.

• Single Instruction Single Data (SISD) conventional sequentialcomputer with one processor, single program and data storage.

• Multiple Instruction Single Data (MISD) used for fault tolerance(Space Shuttle) - from Wikipedia

• Single Instruction Multiple Data (SIMD) each processing elementuses same instruction applied synchronously in parallel todifferent data elements (Connection Machine, GPUs).If-then-else statements take two steps to execute.

• Multiple Instruction Multiple Data (MIMD) each processingelememt loads separate instrution and separate data elements;processors work asynchronously. Since 2006 top tensupercomputers of this type (w/o 10K node SGI Altix Columbiaat NASA Ames)

Update: Single Program Multiple Data (SPMD) autonomousprocessors executing same program but not in lockstep. Mostcommon style of programming. adapted from Berger & Klöckner (NYU 2010)



Finding Concurrency

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

&

! !"#$%&'()'*$'+&',)($'',%$)-."#)$/.&0/1)

0#$."2'#'3&

! 45)$."$".&0"3)"-)$."6./#)&7/&)0()$/./11'1

! 85)($'',%$)"-)$/./11'1)$".&0"3)

! 9)('.0/1)/16".0&7#)+/3):')#/,')$/./11'1):;)

!"#$"#%&'!"#$%&'()$!*+%,+-..!,+/0

! <03,)-%3,/#'3&/1)$/.&()"-)&7')/16".0&7#)

&7/&)/.')('$/./:1'


!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

#

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,#

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,#

! 896)0,-5*#%(".%:'%3'()*+)#'3%:7%:)-5%-"#$%

".3%3"-";

! !"#$;%<,.3%60)1+#%)=%,.#-01(-,).#%-5"-%(".%:'%

'>'(1-'3%,.%+"0"99'9

! %"&";%<,.3%+"0-,-,).#%,.%-5'%3"-"%-5"-%(".%:'%1#'3%

?0'9"-,@'97A%,.3'+'.3'.-97

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,#

! 8."97B'%-5'%"96)0,-5*%".3%=,.3%!"#$%&'#('

)*&+"$,+)#*& -5"-%"0'%?0'9"-,@'97A%,.3'+'.3'.-

! 8."97B'%-5'%"96)0,-5*%".3%=,.3%!"#$%&'#('

)*&+"$,+)#*& -5"-%"0'%?0'9"-,@'97A%,.3'+'.3'.-

! C6;%D"-0,>%D19-,+9,("-,).

! E)*+1-,.6%'"(5%'9'*'.-%)=%F%,#%"%3)-%+0)31(-

slide by Matthew Bolithosee Mattson et al “Patterns for Parallel Programming“ (2004)

!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

'

! !"#$%&'()*'(#$+,-.)*/(#"0(1."0(!"#$%&'#('

)*&+"$,+)#*& )*#)(#-'(2-'$#).3'$%4(."0'5'"0'")

! 6+7(8,$'9:$#-(;%"#/.9<! =,/5:)'>.?-#).,"#$@,-9'<

! =,/5:)'A,)#).,"#$@,-9'<

! =,/5:)';.*'0-#$@,-9'<

! =,/5:)'B'.+*?,:-<

! =,/5:)'B,"C,"0."+@,-9'<

! D50#)'E,<.).,"<!"0>'$,9.).'<

F#<G(;'9,/5,<.).,"

;#)#(;'9,/5,<.).,"

H-,:5(F#<G<

I-0'-(F#<G<

;#)#(J*#-."+

;'9,/5,<.).," ;'5'"0'"9%(!"#$%<.<

! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(

%-"+)+)#*'+./'0-+-

! 6+7(8#)-.L(8:$).5$.9#).,"7(=,$:/"<(#"0(A,K<

1 2

! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(

%-"+)+)#*'+./'0-+-

! 6+7(8#)-.L(8:$).5$.9#).,"7(C$,9G<

1 2

! !"#$%&'()*'(#$+,-.)*/(),(1."0(K#%<(),(

%-"+)+)#*'+./'0-+-

! 6+7(8#)-.L(8:$).5$.9#).,"

1 2

! @."0(K#%<(),(5#-).).,"()*'(0#)#

! 6+7(8#)-.L(8:$).5$.9#).,"

1 2


!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

0

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! !5'0'%"0'%*".7%:"7#%-)%3'()*+)#'%".7%

6,;'.%"96)0,-5*

! 4)*'-,*'#%3"-"%3'()*+)#'%'"#,97

! 4)*'-,*'#%-"#$#%3'()*+)#'%'"#,97

! 4)*'-,*'#%<)-5=! 4)*'-,*'#%.',-5'0=

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! 2.('%-5'%"96)0,-5*%5"#%<''.%3'()*+)#'3%

,.-)%3"-"%".3%-"#$#>

!8."97?' @.-'0"(-,).#

!"#$%&'()*+)#,-,).

&"-"%&'()*+)#,-,).

/0)1+%!"#$#

203'0%!"#$#

&"-"%45"0,.6

&'()*+)#,-,). &'+'.3'.(7%8."97#,# ! !)%'"#'%-5'%*"."6'*'.-%)A%3'+'.3'.(,'#%

A,.3%-"#$#%-5"-%"0'%#,*,9"0%".3%60)1+%-5'*

! !5'.%"."97?'%().#-0",.-#%-)%3'-'0*,.'%".7%

.'('##"07%)03'0


!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

F

! !"#$%&$#'($#)%*%+$)$*'#",#-$.$*-$*/0$&#,0*-#'%&1&#'(%'#%2$#&0)03%2#%*-#+2"4.#'($)

! 5+6#7"3$/43%2#89*%)0/&! :").4'$;0<2%'0"*%3="2/$&

! :").4'$>"'%'0"*%3="2/$&

! :").4'$80($-2%3="2/$&

! :").4'$?$0+(<"42&

! :").4'$?"*@"*-0*+="2/$&

! A.-%'$B"&0'0"*&C*-;$3"/0'0$&

! !"#$%&$#'($#)%*%+$)$*'#",#-$.$*-$*/0$&#,0*-#'%&1&#'(%'#%2$#&0)03%2#%*-#+2"4.#'($)

! 5+6#7"3$/43%2#89*%)0/&! :").4'$;0<2%'0"*%3="2/$&

! :").4'$>"'%'0"*%3="2/$&

! :").4'$80($-2%3="2/$&

! :").4'$?$0+(<"42&

! :").4'$?"*@"*-0*+="2/$&

! A.-%'$B"&0'0"*&C*-;$3"/0'0$&

! :").4'$#@"*-$-#="2/$&! :").4'$;0<2%'0"*%3="2/$&

! :").4'$>"'%'0"*%3="2/$&

! :").4'$80($-2%3="2/$&

! :").4'$#?$0+(<"42&! :").4'$#?"*D@"*-0*+#="2/$&

! A.-%'$B"&0'0"*&C*-;$3"/0'0$&

! E*/$#+2"4.&#",#'%&1&#%2$#0-$*'0,0$-F#-%'%#,3"G#

/"*&'2%0*'&#$*,"2/$#%#.%2'0%3#"2-$26

! E*/$#+2"4.&#",#'%&1&#%2$#0-$*'0,0$-F#-%'%#,3"G#

/"*&'2%0*'&#$*,"2/$#%#.%2'0%3#"2-$26

A.-%'$#B"&0'0"*&#%*-#;$3"/0'0$&

?"*#@"*-$-#="2/$&

?$0+(<"2#H0&'

@"*-$-#="2/$&

!%&1#8$/")."&0'0"*

8%'%#8$/")."&0'0"*

I2"4.#!%&1&

E2-$2#!%&1&

8%'%#J(%20*+

8$/")."&0'0"* 8$.$*-$*/9#C*%39&0&


!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

F$

! !"#$%&'()*'++,%-(.$($.%/(-0&1%-2%)'131%'".%

&'()*)*-"1%-2%.')'%'($%*.$")*2*$.4%'"'+,5$%)6$%

!"#"$%&"'()*$)6')%-##0(1

! 7')'%16'(*"/%#'"%8$%#')$/-(*5$.%'19

! :$'.;-"+,

! <22$#)*=$+,%>-#'+

! :$'.;?(*)$

! @##0A0+')$

! B0+)*&+$%:$'.CD*"/+$%?(*)$

+,"!-.)/0

! 7')'%*1%($'.4%80)%"-)%E(*))$"

! F-%#-"1*1)$"#,%&(-8+$A1

! :$&+*#')*-"%*"%.*1)(*80)$.%1,1)$A

122,3#(4,/0-5.3"/

! 7')'%*1%($'.%'".%E(*))$"

! 7')'%*1%&'()*)*-"$.%*")-%1081$)1

! !"$%)'13%&$(%1081$)

! G'"%.*1)(*80)$%1081$)1

+,"!-6'(#,

! 7')'%*1%($'.%'".%E(*))$"

! B'",%)'131%'##$11%A'",%.')'

! G-"1*1)$"#,%*110$1

! B-1)%.*22*#0+)%)-%.$'+%E*)6

+,"!-6'(#,$!733898/"#(.)%

! @1%&$(%:$'.;?(*)$4%'+)6-0/6%E(*)$1%#-"1*1)%-2%'"%

'##0A0+')*-"%-&$(')*-"

! G-AA-"%*"%($.0#)*-";),&$%'+/-(*)6A1

! G'"%($&+*#')$%1*"#$%'##0A0+')*-"%#'"%8$%+*"$'(


!"#$$%&$'()*+,-.(/$$0

1+2*3+24(56(738892:(;<=,89<>(?<9-@(A<*B,-@(C-,D2+@,86

/E#E/$$0

FF

!"#$%&'()"*!+,-)(.-"*!"#$/0(12-"*&'()"

! !"#$%&#'%()*+&,-%.#(/-01230#14/5#14%#-("6#7&,-%"

! '%/(8%)#914","-%495#914"-&(,4-"

! :8(;$/%<#=(-&,8#=2/-,$/,9(-,14

3 4

! :8(;$/%<#=(-&,8#=2/-,$/,9(-,14

3 4

'%()*>4/5

'%()*>4/5

:??%9-,@%/5*

A19(/

! :8(;$/%<#=1/%92/(&#B54(;,9"

C$)(-%#D1",-,14"#(4)#E%/19,-,%"

F14#G14)%)#H1&9%"

F%,30I1&#A,"-

G14)%)#H1&9%"

H1&9%"

! :8(;$/%<#=1/%92/(&#B54(;,9"

C$)(-%#D1",-,14"#(4)#E%/19,-,%"

F14#G14)%)#H1&9%"

F%,30I1&#A,"-

G14)%)#H1&9%"

!-1;,9#

J11&),4(-%"


Useful patterns(for reference)

Embarrassingly Parallel

yi = fi(xi)where i ∈ {1, . . . ,N}.

Notation: (also for rest of this lecture)

• xi : inputs

• yi : outputs

• fi : (pure) functions (i.e. no side effects)

When does a function have a “side effect”?

In addition to producing a value, it

• modifies non-local state, or

• has an observable interaction with the

outside world.

Often: f1 = · · · = fN . Then

• Lisp/Python function map

• C++ STL std::transform

Embarrassing Partition Pipelines Reduction Scanslide from Berger & Klöckner (NYU 2010)



Embarrassingly Parallel: Graph Representation

x0

y0

f0

x1

y1

f1

x2

y2

f2

x3

y3

f3

x4

y4

f4

x5

y5

f5

x6

y6

f6

x7

y7

f7

x8

y8

f8

Trivial? Often: no.




Embarrassingly Parallel: Examples

Surprisingly useful:

• Element-wise linear algebra:Addition, scalar multiplication (notinner product)

• Image Processing: Shift, rotate,clip, scale, . . .

• Monte Carlo simulation

• (Brute-force) Optimization

• Random Number Generation

• Encryption, Compression(after blocking)

• Software compilation• make -j8

But: Still needs a minimum ofcoordination. How can that beachieved?




Mother-Child ParallelismMother-Child parallelism:

Mother

0 1 2 3 4

Children

Send initial data

Collect results

(formerly called “Master-Slave”)Embarrassing Partition Pipelines Reduction Scan

slide from Berger & Klöckner (NYU 2010)



Embarrassingly Parallel: Issues

• Process Creation:Dynamic/Static?

• MPI 2 supports dynamic processcreation

• Job Assignment (‘Scheduling’):Dynamic/Static?

• Operations/data light- orheavy-weight?

• Variable-size data?• Load Balancing:

• Here: easy

Can you think of a loadbalancing recipe?




Partition

yi = fi(xi−1, xi , xi+1)

where i ∈ {1, . . . ,N}.

Includes straightforward generalizations to dependencies on a larger(but not O(P)-sized!) set of neighbor inputs.




Partition: Graph

x0 x1 x2 x3 x4 x5 x6

y1 y2 y3 y4 y5




Partition: Examples

• Time-marching(in particular: PDE solvers)

• (Including finite differences → HW3!)

• Iterative Methods• Solve Ax = b (Jacobi, . . . )• Optimization (all P on single problem)• Eigenvalue solvers

• Cellular Automata (Game of Life :-)


)



Partition: Issues

• Only useful when the computation

is mainly local

• Responsibility for updating one

datum rests with one processor

• Synchronization, Deadlock,

Livelock, . . .

• Performance Impact

• Granularity

• Load Balancing: Thorny issue

• → next lecture

• Regularity of the Partition?




Pipelined Computation

y = fN(· · · f2(f1(x)) · · · )= (fN ◦ · · · ◦ f1)(x)

where N is fixed.




Pipelined Computation: Graph

x yf1 f1 f2 f3 f4 f6

Processor Assignment?




Pipelined Computation: Examples

• Image processing

• Any multi-stage algorithm

• Pre/post-processing or I/O

• Out-of-Core algorithms

Specific simple examples:

• Sorting (insertion sort)

• Triangular linear system solve

(‘backsubstitution’)

• Key: Pass on values as soon as

they’re available

(will see more efficient algorithms for

both later)




Pipelined Computation: Issues

• Non-optimal while pipeline fills or

empties

• Often communication-inefficient

• for large data

• Needs some attention to

synchronization, deadlock

avoidance

• Can accommodate some

asynchrony

But don’t want:

• Pile-up

• Starvation




Reduction

y = f (· · · f (f (x1, x2), x3), . . . , xN)

where N is the input size.

Also known as. . .

• Lisp/Python function reduce (Scheme: fold)

• C++ STL std::accumulate




Reduction: Graph

y

x1 x2

x3

x4

x5

x6

Painful! Not parallelizable.




Approach to Reduction

f (x ,y)?

Can we do better?

“Tree” very imbalanced. What property

of f would allow ‘rebalancing’?

f (f (x , y), z) = f (x , f (y , z))

Looks less improbable if we let

x ◦ y = f (x , y):

x ◦ (y ◦ z)) = (x ◦ y) ◦ z

Has a very familiar name: Associativity




Reduction: A Better Graph

y

x0 x1 x2 x3 x4 x5 x6 x7

Processor allocation?




Mapping Reduction to the GPU

• Obvious: Want to use tree-based approach.

• Problem: Two scales, Work group and Grid

• Need to occupy both to make good use of the machine.

• In particular, need synchronization after each tree stage.

• Solution: Use a two-scale algorithm.

5

Solution: Kernel DecompositionSolution: Kernel Decomposition

Avoid global sync by decomposing computation into multiple kernel invocations

In the case of reductions, code for all levels is the same

Recursive kernel invocation

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

4 7 5 911 14

25

3 1 7 0 4 1 6 3

Level 0:8 blocks

Level 1:1 block

In particular: Use multiple grid invocations to achieve

inter-workgroup synchronization.With material by M. Harris

(Nvidia Corp.)




Interleaved Addressing

8

Parallel Reduction: Interleaved AddressingParallel Reduction: Interleaved Addressing

2011072-3-253-20-18110Values (shared memory)

0 2 4 6 8 10 12 14

22111179-3-558-2-2-17111Values

0 4 8 12

22111379-3458-26-17118Values

0 8

22111379-31758-26-17124Values

0

22111379-31758-26-17141Values

Thread IDs

Step 1 Stride 1

Step 2 Stride 2

Step 3 Stride 4

Step 4 Stride 8

Thread IDs

Thread IDs

Thread IDs

Issue: Slow modulo, Divergence

With material by M. Harris

(Nvidia Corp.)




Sequential Addressing

14

Parallel Reduction: Sequential AddressingParallel Reduction: Sequential Addressing

2011072-3-253-20-18110Values (shared memory)

0 1 2 3 4 5 6 7

2011072-3-27390610-28Values

0 1 2 3

2011072-3-27390131378Values

0 1

2011072-3-2739013132021Values

0

2011072-3-2739013132041Values

Thread IDs

Step 1 Stride 8

Step 2 Stride 4

Step 3 Stride 2

Step 4 Stride 1

Thread IDs

Thread IDs

Thread IDs

Sequential addressing is conflict freeBetter! But still not “efficient”.

Only half of all work items after first round,

then a quarter, . . . With material by M. Harris

(Nvidia Corp.)




Reduction: Examples

• Sum, Inner Product, Norm

• Occurs in iterative methods

• Minimum, Maximum

• Data Analysis

• Evaluation of Monte Carlo

Simulations

• List Concatenation, Set Union

• Matrix-Vector product (but. . . )




Reduction: Issues

• When adding: floating point

cancellation?

• Serial order goes faster:

can use registers for intermediate

results

• Requires availability of neutral

element

• GPU-Reduce: Optimization

sensitive to data type




Map-Reduce

y = f (· · · f (f (g(x1), g(x2)),g(x3)), . . . , g(xN))


• Lisp naming, again

• Mild generalization of reduction




Map-Reduce: Graph

y

x0

g

x1

g

x2

g

x3

g

x4

g

x5

g

x6

g

x7

g




MapReduce: Discussion

MapReduce ≥ map + reduce:

• Used by Google (and many others) forlarge-scale data processing

• Map generates (key, value) pairs• Reduce operates only on pairs with

identical keys• Remaining output sorted by key

• Represent all data as character strings• User must convert to/from internal repr.

• Messy implementation• Parallelization, fault tolerance, monitoring,

data management, load balance, re-run“stragglers”, data locality

• Works for Internet-size data

• Simple to use even for inexperienced users




MapReduce: Examples

• String search

• (e.g. URL) Hit count from Log

• Reverse web-link graph

• desired: (target URL, sources)

• Sort

• Indexing

• desired: (word, document IDs)

• Machine Learning, Clustering, . . .




Scan

y1 = x1y2 = f (y1, x2)... = ...

yN = f (yN−1, xN)


• Also called “prefix sum”.

• Or cumulative sum (‘cumsum’) by Matlab/NumPy.




Scan: Graph

x0

y0

x1

y1

x2

y2

x3

y3

x4

y4

x5

y5

y1

Id

y2

Id

y3

Id

y4

Id y5

Id

Id

This can’t possibly be parallelized.

Or can it?




Scan: Graph

x0

y0

x1

y1

x2

y2

x3

y3

x4

y4

x5

y5

y1

Id

y2

Id

y3

Id

y4

Id y5

Id

Id

This can’t possibly be parallelized.

Or can it?

Again: Need assumptions on f .Associativity, commutativity.




Scan: Implementation

Work-efficient?




Scan: Implementation II

Two sweeps: Upward, downward,

both tree-shape

On upward sweep:

• Get values L and R from left and right

child

• Save L in local variable Mine

• Compute Tmp = L+ R and pass to parent

On downward sweep:

• Get value Tmp from parent

• Send Tmp to left child

• Sent Tmp+Mine to right child

Work-efficient?

Span rel. to first attempt?




Scan: Implementation II

Two sweeps: Upward, downward,

both tree-shape

On upward sweep:

• Get values L and R from left and right

child

• Save L in local variable Mine

• Compute Tmp = L+ R and pass to parent

On downward sweep:

• Get value Tmp from parent

• Send Tmp to left child

• Sent Tmp+Mine to right childWork-efficient?

Span rel. to first attempt?




Scan: Examples

• Anything with a loop-carried

dependence

• One row of Gauss-Seidel

• One row of triangular solve

• Segment numbering if boundaries

are known

• Low-level building block for many

higher-level algorithms algorithms

• FIR/IIR Filtering

• G.E. Blelloch:

Prefix Sums and their Applications




Scan: Issues

• Subtlety: Inclusive/Exclusive Scan• Pattern sometimes hard torecognize

• But shows up surprisingly often• Need to prove

associativity/commutativity

• Useful in Implementation:algorithm cascading

• Do sequential scan on parts, thenparallelize at coarser granularities




Divide and Conquer

yi = fi(x1, . . . , xN)for i ∈ {1, dots,M}.

Main purpose: A way ofpartitioning up fullydependent tasks.

x0 x1 x2 x3 x4 x5 x6 x7

x0 x1 x2 x3 x4 x5 x6 x7

x0 x1 x2 x3 x4 x5 x6 x7

u0 u1 u2 u3 u4 u5 u6 u7

x0

y0

x1

y1

x2

y2

x3

y3

x4

y4

x5

y5

x6

y6

x7

y7

v0 v1 v2 v3 v4 v5 v6 v7

w0 w1 w2 w3 w4 w5 w6 w7Processor allocation?

D&C Generalslide from Berger & Klöckner (NYU 2010)



Divide and Conquer: Examples

• GEMM, TRMM, TRSM, GETRF

(LU)

• FFT

• Sorting: Bucket sort, Merge sort

• N-Body problems (Barnes-Hut,

FMM)

• Adaptive Integration

More fun with work and span:

D&C analysis lecture




Divide and Conquer: Issues

• “No idea how to parallelize that”• → Try D&C

• Non-optimal during partition, merge• But: Does not matter if deep levels do

heavy enough processing

• Subtle to map to fixed-width machines(e.g. GPUs)

• Varying data size along tree

• Bookkeeping nontrivial for non-2n sizes

• Side benefit: D&C is generallycache-friendly