Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and...
Transcript of Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and...
![Page 1: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/1.jpg)
Computer Architecture: SIMD and GPUs (Part III)
(and briefly VLIW, DAE, Systolic Arrays)
Prof. Onur Mutlu Carnegie Mellon University
![Page 2: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/2.jpg)
A Note on This Lecture n These slides are partly from 18-447 Spring 2013, Computer
Architecture, Lecture 20: GPUs, VLIW, DAE, Systolic Arrays
n Video of the part related to only SIMD and GPUs: q http://www.youtube.com/watch?
v=vr5hbSkb1Eg&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=20
2
![Page 3: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/3.jpg)
Last Lecture n SIMD Processing n GPU Fundamentals
3
![Page 4: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/4.jpg)
Today n Wrap up GPUs n VLIW
n If time permits q Decoupled Access Execute q Systolic Arrays q Static Scheduling
4
![Page 5: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/5.jpg)
Approaches to (Instruction-Level) Concurrency
n Pipelined execution n Out-of-order execution n Dataflow (at the ISA level) n SIMD Processing n VLIW
n Systolic Arrays n Decoupled Access Execute
5
![Page 6: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/6.jpg)
Graphics Processing Units SIMD not Exposed to Programmer (SIMT)
![Page 7: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/7.jpg)
Review: High-Level View of a GPU
7
![Page 8: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/8.jpg)
Review: Concept of “Thread Warps” and SIMT
n Warp: A set of threads that execute the same instruction (on different data elements) à SIMT (Nvidia-speak)
n All threads run the same kernel n Warp: The threads that run lengthwise in a woven fabric …
8
Thread Warp 3 Thread Warp 8
Thread Warp 7
Thread Warp Scalar Thread
W
Scalar Thread
X
Scalar Thread
Y
Scalar Thread
Z
Common PC
SIMD Pipeline
![Page 9: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/9.jpg)
Review: Loop Iterations as Threads
9
for (i=0; i < N; i++) C[i] = A[i] + B[i];
load
load
add
store
load
load
add
store
Iter. 1
Iter. 2
Scalar Sequential Code
Vector Instruction
load
load
add
store
load
load
add
store
Iter. 1
Iter. 2
Vectorized Code
Tim
e
Slide credit: Krste Asanovic
![Page 10: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/10.jpg)
n Same instruction in different threads uses thread id to index and access different data elements
Review: SIMT Memory Access
Let’s assume N=16, blockDim=4 à 4 blocks
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 +
+ + + +
Slide credit: Hyesoon Kim
![Page 11: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/11.jpg)
Review: Sample GPU SIMT Code (Simplified)
for (ii = 0; ii < 100; ++ii) { C[ii] = A[ii] + B[ii]; }
// there are 100 threads __global__ void KernelFunction(…) { int tid = blockDim.x * blockIdx.x + threadIdx.x; int varA = aa[tid]; int varB = bb[tid]; C[tid] = varA + varB; }
CPU code
CUDA code
Slide credit: Hyesoon Kim
![Page 12: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/12.jpg)
Review: Sample GPU Program (Less Simplified)
12 Slide credit: Hyesoon Kim
![Page 13: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/13.jpg)
Review: Latency Hiding with “Thread Warps”
n Warp: A set of threads that execute the same instruction (on different data elements)
n Fine-grained multithreading q One instruction per thread in
pipeline at a time (No branch prediction)
q Interleave warp execution to hide latencies
n Register values of all threads stay in register file
n No OS context switching n Memory latency hiding
q Graphics has millions of pixels
13
Decode
R F
R F
R F
A L U
A L U
A L U
D-Cache
Thread Warp 6
Thread Warp 1 Thread Warp 2 Data All Hit?
Miss?
Warps accessing memory hierarchy
Thread Warp 3 Thread Warp 8
Writeback
Warps available for scheduling
Thread Warp 7
I-Fetch
SIMD Pipeline
Slide credit: Tor Aamodt
![Page 14: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/14.jpg)
Review: Warp-based SIMD vs. Traditional SIMD n Traditional SIMD contains a single thread
q Lock step q Programming model is SIMD (no threads) à SW needs to know vector
length q ISA contains vector/SIMD instructions
n Warp-based SIMD consists of multiple scalar threads executing in a SIMD manner (i.e., same instruction executed by all threads) q Does not have to be lock step q Each thread can be treated individually (i.e., placed in a different
warp) à programming model not SIMD n SW does not need to know vector length n Enables memory and branch latency tolerance
q ISA is scalar à vector instructions formed dynamically q Essentially, it is SPMD programming model implemented on SIMD
hardware 14
![Page 15: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/15.jpg)
Review: SPMD n Single procedure/program, multiple data
q This is a programming model rather than computer organization
n Each processing element executes the same procedure, except on different data elements q Procedures can synchronize at certain points in program, e.g. barriers
n Essentially, multiple instruction streams execute the same program q Each program/procedure can 1) execute a different control-flow path,
2) work on different data, at run-time q Many scientific applications programmed this way and run on MIMD
computers (multiprocessors) q Modern GPUs programmed in a similar way on a SIMD computer
15
![Page 16: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/16.jpg)
Branch Divergence Problem in Warp-based SIMD
n SPMD Execution on SIMD Hardware q NVIDIA calls this “Single Instruction, Multiple Thread” (“SIMT”)
execution
16
Thread Warp Common PC
Thread 2
Thread 3
Thread 4
Thread 1
B
C D
E
F
A
G
Slide credit: Tor Aamodt
![Page 17: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/17.jpg)
Control Flow Problem in GPUs/SIMD n GPU uses SIMD
pipeline to save area on control logic. q Group scalar threads into
warps
n Branch divergence occurs when threads inside warps branch to different execution paths.
17
Branch
Path A
Path B
Branch
Path A
Path B
Slide credit: Tor Aamodt
![Page 18: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/18.jpg)
Branch Divergence Handling (I)
18
- G 1111 TOS
B
C D
E
F
A
G
Thread Warp Common PC
Thread 2
Thread 3
Thread 4
Thread 1
B/1111
C/1001 D/0110
E/1111
A/1111
G/1111
- A 1111 TOS E D 0110 E C 1001 TOS
- E 1111 E D 0110 TOS - E 1111
A D G A
Time
C B E
- B 1111 TOS - E 1111 TOS Reconv. PC Next PC Active Mask
Stack
E D 0110 E E 1001 TOS
- E 1111
Slide credit: Tor Aamodt
![Page 19: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/19.jpg)
Branch Divergence Handling (II)
19
A
B C
D
A -- 1111 B D 1110 C D 0001
Next PC Recv PC Amask D -- 1111
Control Flow Stack
One per warp
A; if (some condition) { B; } else { C; } D; TOS
D
1 1 1 1
A 0 0 0 1
C 1 1 1 0
B 1 1 1 1
D
Time
Execution Sequence
Slide credit: Tor Aamodt
![Page 20: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/20.jpg)
Dynamic Warp Formation n Idea: Dynamically merge threads executing the same
instruction (after branch divergence) n Form new warp at divergence
q Enough threads branching to each path to create full new warps
20
![Page 21: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/21.jpg)
Dynamic Warp Formation/Merging n Idea: Dynamically merge threads executing the same
instruction (after branch divergence)
n Fung et al., “Dynamic Warp Formation and Scheduling for
Efficient GPU Control Flow,” MICRO 2007. 21
Branch
Path A
Path B
Branch
Path A
![Page 22: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/22.jpg)
Dynamic Warp Formation Example
22
A A B B G G A A C C D D E E F F
Time A A B B G G A A C D E E F
Time
A x/1111 y/1111
B x/1110 y/0011
C x/1000 y/0010 D x/0110
y/0001 F x/0001 y/1100
E x/1110 y/0011
G x/1111 y/1111
A new warp created from scalar threads of both Warp x and y executing at Basic Block D
D
Execution of Warp x at Basic Block A
Execution of Warp y at Basic Block A
Legend A A
Baseline
Dynamic Warp Formation
Slide credit: Tor Aamodt
![Page 23: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/23.jpg)
What About Memory Divergence? n Modern GPUs have caches n Ideally: Want all threads in the warp to hit (without
conflicting with each other) n Problem: One thread in a warp can stall the entire warp if it
misses in the cache. n Need techniques to
q Tolerate memory divergence q Integrate solutions to branch and memory divergence
23
![Page 24: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/24.jpg)
NVIDIA GeForce GTX 285 n NVIDIA-speak:
q 240 stream processors q “SIMT execution”
n Generic speak: q 30 cores q 8 SIMD functional units per core
24 Slide credit: Kayvon Fatahalian
![Page 25: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/25.jpg)
NVIDIA GeForce GTX 285 “core”
25
…
= instruction stream decode = SIMD functional unit, control shared across 8 units = execution context storage = multiply-add
= multiply
64 KB of storage for fragment contexts (registers)
Slide credit: Kayvon Fatahalian
![Page 26: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/26.jpg)
NVIDIA GeForce GTX 285 “core”
26
… 64 KB of storage for thread contexts (registers)
n Groups of 32 threads share instruction stream (each group is a Warp)
n Up to 32 warps are simultaneously interleaved n Up to 1024 thread contexts can be stored Slide credit: Kayvon Fatahalian
![Page 27: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/27.jpg)
NVIDIA GeForce GTX 285
Tex
Tex
Tex
Tex
Tex
Tex
Tex
Tex
Tex
Tex
… … …
………
………
………
………
………
………
………
………
………
27
30 cores on the GTX 285: 30,720 threads
Slide credit: Kayvon Fatahalian
![Page 28: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/28.jpg)
VLIW and DAE
![Page 29: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/29.jpg)
Remember: SIMD/MIMD Classification of Computers
n Mike Flynn, “Very High Speed Computing Systems,” Proc. of the IEEE, 1966
n SISD: Single instruction operates on single data element n SIMD: Single instruction operates on multiple data elements
q Array processor q Vector processor
n MISD? Multiple instructions operate on single data element q Closest form: systolic array processor?
n MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) q Multiprocessor q Multithreaded processor
29
![Page 30: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/30.jpg)
SISD Parallelism Extraction Techniques n We have already seen
q Superscalar execution q Out-of-order execution
n Are there simpler ways of extracting SISD parallelism? q VLIW (Very Long Instruction Word) q Decoupled Access/Execute
30
![Page 31: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/31.jpg)
VLIW
![Page 32: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/32.jpg)
VLIW (Very Long Instruction Word) n A very long instruction word consists of multiple
independent instructions packed together by the compiler q Packed instructions can be logically unrelated (contrast with
SIMD)
n Idea: Compiler finds independent instructions and statically schedules (i.e. packs/bundles) them into a single VLIW instruction
n Traditional Characteristics q Multiple functional units q Each instruction in a bundle executed in lock step q Instructions in a bundle statically aligned to be directly fed
into the functional units 32
![Page 33: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/33.jpg)
VLIW Concept
n Fisher, “Very Long Instruction Word architectures and the
ELI-512,” ISCA 1983. q ELI: Enormously longword instructions (512 bits)
33
![Page 34: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/34.jpg)
SIMD Array Processing vs. VLIW n Array processor
34
![Page 35: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/35.jpg)
VLIW Philosophy n Philosophy similar to RISC (simple instructions and hardware)
q Except multiple instructions in parallel
n RISC (John Cocke, 1970s, IBM 801 minicomputer) q Compiler does the hard work to translate high-level language
code to simple instructions (John Cocke: control signals) n And, to reorder simple instructions for high performance
q Hardware does little translation/decoding à very simple
n VLIW (Fisher, ISCA 1983) q Compiler does the hard work to find instruction level parallelism q Hardware stays as simple and streamlined as possible
n Executes each instruction in a bundle in lock step n Simple à higher frequency, easier to design
35
![Page 36: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/36.jpg)
VLIW Philosophy (II)
36 Fisher, “Very Long Instruction Word architectures and the ELI-512,” ISCA 1983.
![Page 37: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/37.jpg)
Commercial VLIW Machines n Multiflow TRACE, Josh Fisher (7-wide, 28-wide) n Cydrome Cydra 5, Bob Rau n Transmeta Crusoe: x86 binary-translated into internal VLIW n TI C6000, Trimedia, STMicro (DSP & embedded processors)
q Most successful commercially
n Intel IA-64 q Not fully VLIW, but based on VLIW principles q EPIC (Explicitly Parallel Instruction Computing) q Instruction bundles can have dependent instructions q A few bits in the instruction format specify explicitly which
instructions in the bundle are dependent on which other ones
37
![Page 38: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/38.jpg)
VLIW Tradeoffs n Advantages
+ No need for dynamic scheduling hardware à simple hardware + No need for dependency checking within a VLIW instruction à
simple hardware for multiple instruction issue + no renaming + No need for instruction alignment/distribution after fetch to
different functional units à simple hardware
n Disadvantages -- Compiler needs to find N independent operations
-- If it cannot, inserts NOPs in a VLIW instruction -- Parallelism loss AND code size increase
-- Recompilation required when execution width (N), instruction latencies, functional units change (Unlike superscalar processing)
-- Lockstep execution causes independent operations to stall -- No instruction can progress until the longest-latency instruction completes
38
![Page 39: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/39.jpg)
VLIW Summary n VLIW simplifies hardware, but requires complex compiler
techniques n Solely-compiler approach of VLIW has several downsides
that reduce performance -- Too many NOPs (not enough parallelism discovered) -- Static schedule intimately tied to microarchitecture
-- Code optimized for one generation performs poorly for next -- No tolerance for variable or long-latency operations (lock step)
++ Most compiler optimizations developed for VLIW employed in optimizing compilers (for superscalar compilation)
q Enable code optimizations
++ VLIW successful in embedded markets, e.g. DSP
39
![Page 40: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/40.jpg)
DAE
![Page 41: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/41.jpg)
Decoupled Access/Execute n Motivation: Tomasulo’s algorithm too complex to
implement q 1980s before HPS, Pentium Pro
n Idea: Decouple operand access and execution via two separate instruction streams that communicate via ISA-visible queues.
n Smith, “Decoupled Access/Execute Computer Architectures,” ISCA 1982, ACM TOCS 1984.
41
![Page 42: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/42.jpg)
Decoupled Access/Execute (II) n Compiler generates two instruction streams (A and E)
q Synchronizes the two upon control flow instructions (using branch queues)
42
![Page 43: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/43.jpg)
Decoupled Access/Execute (III) n Advantages:
+ Execute stream can run ahead of the access stream and vice versa + If A takes a cache miss, E can perform useful work
+ If A hits in cache, it supplies data to lagging E + Queues reduce the number of required registers
+ Limited out-of-order execution without wakeup/select complexity
n Disadvantages: -- Compiler support to partition the program and manage queues
-- Determines the amount of decoupling -- Branch instructions require synchronization between A and E -- Multiple instruction streams (can be done with a single one, though)
43
![Page 44: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/44.jpg)
Astronautics ZS-1 n Single stream
steered into A and X pipelines
n Each pipeline in-order
n Smith et al., “The ZS-1 central processor,” ASPLOS 1987.
n Smith, “Dynamic Instruction Scheduling and the Astronautics ZS-1,” IEEE Computer 1989.
44
![Page 45: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/45.jpg)
Astronautics ZS-1 Instruction Scheduling n Dynamic scheduling
q A and X streams are issued/executed independently q Loads can bypass stores in the memory unit (if no conflict) q Branches executed early in the pipeline
n To reduce synchronization penalty of A/X streams n Works only if the register a branch sources is available
n Static scheduling q Move compare instructions as early as possible before a branch
n So that branch source register is available when branch is decoded
q Reorder code to expose parallelism in each stream q Loop unrolling:
n Reduces branch count + exposes code reordering opportunities
45
![Page 46: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/46.jpg)
Loop Unrolling
n Idea: Replicate loop body multiple times within an iteration + Reduces loop maintenance overhead
q Induction variable increment or loop condition test
+ Enlarges basic block (and analysis scope) q Enables code optimization and scheduling opportunities
-- What if iteration count not a multiple of unroll factor? (need extra code to detect this)
-- Increases code size 46
![Page 47: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/47.jpg)
Systolic Arrays
47
![Page 48: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/48.jpg)
Why Systolic Architectures? n Idea: Data flows from the computer memory in a rhythmic
fashion, passing through many processing elements before it returns to memory
n Similar to an assembly line q Different people work on the same car q Many cars are assembled simultaneously q Can be two-dimensional
n Why? Special purpose accelerators/architectures need q Simple, regular designs (keep # unique parts small and regular) q High concurrency à high performance q Balanced computation and I/O (memory access)
48
![Page 49: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/49.jpg)
Systolic Architectures n H. T. Kung, “Why Systolic Architectures?,” IEEE Computer 1982.
49
Memory: heart PEs: cells Memory pulses data through cells
![Page 50: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/50.jpg)
Systolic Architectures n Basic principle: Replace a single PE with a regular array of
PEs and carefully orchestrate flow of data between the PEs à achieve high throughput w/o increasing memory bandwidth requirements
n Differences from pipelining: q Array structure can be non-linear and multi-dimensional q PE connections can be multidirectional (and different speed) q PEs can have local memory and execute kernels (rather than a
piece of the instruction)
50
![Page 51: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/51.jpg)
Systolic Computation Example n Convolution
q Used in filtering, pattern matching, correlation, polynomial evaluation, etc …
q Many image processing tasks
51
![Page 52: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/52.jpg)
Systolic Computation Example: Convolution
n y1 = w1x1 + w2x2 + w3x3
n y2 = w1x2 + w2x3 + w3x4
n y3 = w1x3 + w2x4 + w3x5
52
![Page 53: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/53.jpg)
Systolic Computation Example: Convolution
n Worthwhile to implement adder and multiplier separately to allow overlapping of add/mul executions
53
![Page 54: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/54.jpg)
n Each PE in a systolic array q Can store multiple “weights” q Weights can be selected on the fly q Eases implementation of, e.g., adaptive filtering
n Taken further q Each PE can have its own data and instruction memory q Data memory à to store partial/temporary results, constants q Leads to stream processing, pipeline parallelism
n More generally, staged execution
54
More Programmability
![Page 55: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/55.jpg)
Pipeline Parallelism
55
![Page 56: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/56.jpg)
File Compression Example
56
![Page 57: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/57.jpg)
Systolic Array n Advantages
q Makes multiple uses of each data item à reduced need for fetching/refetching
q High concurrency q Regular design (both data and control flow)
n Disadvantages q Not good at exploiting irregular parallelism q Relatively special purpose à need software, programmer
support to be a general purpose model
57
![Page 58: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/58.jpg)
The WARP Computer n HT Kung, CMU, 1984-1988
n Linear array of 10 cells, each cell a 10 Mflop programmable processor
n Attached to a general purpose host machine n HLL and optimizing compiler to program the systolic array n Used extensively to accelerate vision and robotics tasks
n Annaratone et al., “Warp Architecture and Implementation,” ISCA 1986.
n Annaratone et al., “The Warp Computer: Architecture, Implementation, and Performance,” IEEE TC 1987.
58
![Page 59: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/59.jpg)
The WARP Computer
59
![Page 60: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/60.jpg)
The WARP Computer
60
![Page 61: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/61.jpg)
Systolic Arrays vs. SIMD n Food for thought…
61
![Page 62: Computer Architecture: SIMD and GPUs (Part III)ece740/f13/lib/exe/...Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie](https://reader035.fdocuments.net/reader035/viewer/2022070720/5ee1023bad6a402d666c0c1c/html5/thumbnails/62.jpg)
Some More Recommended Readings n Recommended:
q Fisher, “Very Long Instruction Word architectures and the ELI-512,” ISCA 1983.
q Huck et al., “Introducing the IA-64 Architecture,” IEEE Micro 2000.
q Russell, “The CRAY-1 computer system,” CACM 1978. q Rau and Fisher, “Instruction-level parallel processing: history,
overview, and perspective,” Journal of Supercomputing, 1993. q Faraboschi et al., “Instruction Scheduling for Instruction Level
Parallel Processors,” Proc. IEEE, Nov. 2001.
62