GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu [email protected]@unibo.it...
-
Upload
imogen-norris -
Category
Documents
-
view
218 -
download
3
Transcript of GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu [email protected]@unibo.it...
![Page 2: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/2.jpg)
Old and New Wisdom in Computer Architecture
• Old: Power is free, Transistors are expensive• New: “Power wall”, Power expensive, Transistors free
(Can put more transistors on chip than can afford to turn on)
• Old: Multiplies are slow, Memory access is fast• New: “Memory wall”, Multiplies fast, Memory slow
(200 clocks to DRAM memory, 4 clocks for FP multiply)
• Old: Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …)
• New: “ILP wall”, diminishing returns on more ILP HW
(Explicit thread and data parallelism must be exploited)
• New: Power Wall + Memory Wall + ILP Wall = Brick Wall
![Page 3: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/3.jpg)
Uniprocessor Performance (SPECint)
![Page 4: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/4.jpg)
SW Performance: 1993-2008
![Page 5: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/5.jpg)
Instruction-Stream Based Processing
![Page 6: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/6.jpg)
Data-Stream-Based Processing
![Page 7: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/7.jpg)
Instruction- and Data-Streams
![Page 8: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/8.jpg)
Architectures: Data–Processor Locality• Field Programmable Gate Array (FPGA)
– Compute by configuring Boolean functions and local memory
• Processor Array / Multi-core Processor– Assemble many (simple) processors and memories on one chip
• Processor-in-Memory (PIM)– Insert processing elements directly into RAM chips
• Stream Processor– Create data locality through a hierarchy of memories
• Graphics Processor Unit (GPU)– Hide data access latencies by keeping 1000s of threads in-flight
GPUs often excel in the performance/price ratio
![Page 9: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/9.jpg)
Graphics Processing Unit (GPU)
• Development driven by the multi-billion dollar game industry– Bigger than Hollywood
• Need for physics, AI and complex lighting models
• Impressive Flops / dollar performance– Hardware has to be affordable
• Evolution speed surpasses Moore’s law– Performance doubling approximately
6 months
![Page 10: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/10.jpg)
What is GPGPU?• The graphics processing unit (GPU) on commodity video cards has evolved into an
extremely flexible and powerful processor– Programmability
– Precision
– Power
• GPGPU: an emerging field seeking to harness GPUs for general-purpose computation other than 3D graphics
– GPU accelerates critical path of application
• Data parallel algorithms leverage GPU attributes– Large data arrays, streaming throughput
– Fine-grain SIMD parallelism
– Low-latency floating point (FP) computation
• Applications – see //GPGPU.org– Game effects (FX) physics, image processing
– Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting
![Page 11: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/11.jpg)
Motivation 1:• Computational Power
– GPUs are fast…– GPUs are getting faster, faster
![Page 12: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/12.jpg)
Motivation 2:
• Flexible, Precise and Cheap:– Modern GPUs are deeply programmable
• Solidifying high-level language support
– Modern GPUs support high precision• 32 bit floating point throughout the pipeline• High enough for many (not all) applications
![Page 13: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/13.jpg)
Parallel Computing on a GPU
• NVIDIA GPU Computing Architecture– Via a separate HW interface – In laptops, desktops, workstations, servers
• 8-series GPUs deliver 50 to 200 GFLOPSon compiled parallel C applications
• GPU parallelism is doubling every year• Programming model scales transparently
• Programmable in C with CUDA tools• Multithreaded SPMD model uses application
data parallelism and thread parallelism
GeForce 8800
Tesla S870
Tesla D870
![Page 14: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/14.jpg)
Towards GPGPU
• The previous 3D GPU– A fixed function graphics pipeline
• The modern 3D GPU– A Programmable parallel processor
• NVIDIA’s Tesla and Fermi architectures– Unifies the vertex and pixel processors
![Page 15: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/15.jpg)
![Page 16: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/16.jpg)
![Page 17: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/17.jpg)
![Page 18: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/18.jpg)
![Page 19: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/19.jpg)
![Page 20: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/20.jpg)
![Page 21: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/21.jpg)
![Page 22: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/22.jpg)
![Page 23: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/23.jpg)
![Page 24: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/24.jpg)
![Page 25: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/25.jpg)
![Page 26: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/26.jpg)
The evolution of the pipeline
Elements of the graphics pipeline:1. A scene description: vertices,
triangles, colors, lighting2. Transformations that map the scene
to a camera viewpoint3. “Effects”: texturing, shadow
mapping, lighting calculations4. Rasterizing: converting geometry
into pixels 5. Pixel processing: depth tests, stencil
tests, and other per-pixel operations.
Parameters controlling design of the pipeline:
1. Where is the boundary between CPU and GPU ?
2. What transfer method is used ?3. What resources are provided at
each step ? 4. What units can access which
GPU memory elements ?
![Page 27: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/27.jpg)
Generation I: 3dfx Voodoo (1996)
• One of the first true 3D game cards• Worked by supplementing standard 2D
video card.• Did not do vertex transformations: these
were done in the CPU• Did do texture mapping, z-buffering.
PrimitiveAssembly
PrimitiveAssembly
VertexTransforms
VertexTransforms
Frame Buffer
Frame Buffer
RasterOperations
RasterizationandInterpolation
CPU GPUPCI
![Page 28: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/28.jpg)
VertexTransforms
VertexTransforms
Generation II: GeForce/Radeon 7500 (1998)
• Main innovation: shifting the transformation and lighting calculations to the GPU
• Allowed multi-texturing: giving bump maps, light maps, and others..
• Faster AGP bus instead of PCI
PrimitiveAssembly
PrimitiveAssembly
Frame Buffer
Frame Buffer
RasterOperations
RasterizationandInterpolation
GPUAGP
![Page 29: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/29.jpg)
VertexTransforms
VertexTransforms
Generation III: GeForce3/Radeon 8500(2001)
• For the first time, allowed limited amount of programmability in the vertex pipeline
• Also allowed volume texturing and multi-sampling (for antialiasing)
PrimitiveAssembly
PrimitiveAssembly
Frame Buffer
Frame Buffer
RasterOperations
RasterizationandInterpolation
GPUAGP
Small vertexshaders
Small vertexshaders
![Page 30: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/30.jpg)
VertexTransforms
VertexTransforms
Generation IV: Radeon 9700/GeForce FX (2002)
• This generation is the first generation of fully-programmable graphics cards
• Different versions have different resource limits on fragment/vertex programs
PrimitiveAssembly
PrimitiveAssembly
Frame Buffer
Frame Buffer
RasterOperations
RasterizationandInterpolation
AGPProgrammableVertex shader
ProgrammableVertex shader
ProgrammableFragmentProcessor
ProgrammableFragmentProcessor
![Page 31: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/31.jpg)
VertexIndexStream
3D APICommands
AssembledPrimitives
PixelUpdates
PixelLocationStream
ProgrammableFragmentProcessor
ProgrammableFragmentProcessor
Tra
nsf
orm
ed
Vert
ices
ProgrammableVertexProcessor
ProgrammableVertexProcessor
GPUFront End
GPUFront End
PrimitiveAssembly
PrimitiveAssembly
Frame Buffer
Frame Buffer
RasterOperations
RasterizationandInterpolation
3D API:OpenGL orDirect3D
3D API:OpenGL orDirect3D
3DApplicationOr Game
3DApplicationOr Game
Pre
-transfo
rmed
Vertice
s
Pre
-transfo
rmed
Fragm
en
ts
Tra
nsf
orm
ed
Fragm
en
ts
GPU
Com
mand &
Data
Stre
am
CPU-GPU Boundary (AGP/PCIe)
•Vertex processors•Operation on the vertices of primitives
•Points, lines, and triangles•Typical Operations
•Transforming coordinates•Setting up lighting and texture parameters
•Pixel processors•Operation on rasterizer output•Typical Operations
•Filling the interior of primitives
![Page 32: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/32.jpg)
The road to unification
• Vertex and pixel processors have evolved at different rates
• Because GPUs typically must process more pixels than vertices, pixel-fragment processors traditionally outnumber vertex processors by about three to one.
• However, typical workloads are not well balanced, leading to inefficiency. – For example, with large triangles, the vertex processors are mostly idle, while the pixel
processors are fully busy. With small triangles, the opposite is true.
• The addition of more-complex primitive processing makes it much harder to select a fixed processor ratio.
• Increased generality Increased the design complexity, area and cost of developing two separate processors
• All these factors influenced the decision to design a unified architecture:– to execute vertex and pixel-fragment shader programs on the same unified processor
architecture.
![Page 33: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/33.jpg)
Previous GPGPU Constraints
![Page 34: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/34.jpg)
What’s wrong with GPGPU?
![Page 35: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/35.jpg)
From pixel/fragment to thread program…
![Page 36: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/36.jpg)
CPU style cores CPU-“style”
![Page 37: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/37.jpg)
Slimming down
![Page 38: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/38.jpg)
Two cores
![Page 39: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/39.jpg)
Four cores
![Page 40: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/40.jpg)
Sixteen cores
![Page 41: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/41.jpg)
Add ALUs
![Page 42: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/42.jpg)
128 elements in parallel
![Page 43: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/43.jpg)
But what about branches?
![Page 44: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/44.jpg)
But what about branches?
![Page 45: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/45.jpg)
But what about branches?
![Page 46: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/46.jpg)
But what about branches?
![Page 47: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/47.jpg)
Clarification
SIMD processing does not imply SIMD instructions • Option 1: Explicit vector instructions–Intel/AMD x86 SSE,
Intel Larrabee• Option 2: Scalar instructions, implicit HW vectorization
– HW determines instruction stream sharing across ALUs (amount of sharing hidden from software)
– NVIDIA GeForce (“SIMT”warps), ATI Radeon architectures
![Page 48: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/48.jpg)
Stalls!
• Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation.
• Memory access latency = 100’s to 1000’s of cycles• We’ve removed the fancy caches and logic that helps
avoid stalls.• But we have LOTS of independent work items.• Idea #3: Interleave processing of many elements on a
single core to avoid stalls caused by high latency operations.
![Page 49: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/49.jpg)
Hiding stalls
![Page 50: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/50.jpg)
Hiding stalls
![Page 51: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/51.jpg)
Hiding stalls
![Page 52: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/52.jpg)
Hiding stalls
![Page 53: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/53.jpg)
Hiding stalls
![Page 54: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/54.jpg)
Throughput!
![Page 55: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/55.jpg)
Summary: Three key ideas
• Use many “slimmed down cores”to run in parallel• Pack cores full of ALUs(by sharing instruction
stream across groups of work items)• Avoid latency stalls by interleaving execution of
many groups of workitems/ threads/ ...– When one group stalls, work on another group
![Page 56: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/56.jpg)
Global memory
![Page 57: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/57.jpg)
Parallel data cache
![Page 58: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/58.jpg)
NVIDIA Tesla
![Page 59: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/59.jpg)
CUDA Device Memory Space Overview
• Each thread can:– R/W per-thread registers– R/W per-thread local memory– R/W per-block shared memory– R/W per-grid global memory– Read only per-grid constant
memory– Read only per-grid texture
memory
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host• The host can R/W global, constant, and texture memories
![Page 60: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/60.jpg)
Global, Constant, and Texture Memories(Long Latency Accesses)
• Global memory– Main means of
communicating R/W Data between host and device
– Contents visible to all threads
• Texture and Constant Memories– Constants initialized by
host – Contents visible to all
threads
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
![Page 61: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/61.jpg)
Memory Hierarchy
• CPU and GPU Memory HierarchyDisk
CPU Main Memory
GPU Video Memory
CPU Caches
CPU Registers GPU Caches
GPU Temporary Registers
GPU Constant Registers
Slow
![Page 62: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/62.jpg)
NVIDIA’s Fermi Generation CUDA Compute Architecture:
The key architectural highlights of Fermi are:
• Third Generation Streaming Multiprocessor (SM)– 32 CUDA cores per SM, 4x over GT200– 8x the peak double precision floating
point performance over GT200
• Second Generation ParallelThread Execution ISA
– Unified Address Space with Full C++ Support– Optimized for OpenCL and DirectCompute
• Improved Memory Subsystem– NVIDIA Parallel DataCache hierarchy
with Configurable L1 and Unified L2 Caches – improved atomic memory op performance
• NVIDIA GigaThreadTM Engine– 10x faster application context switching– Concurrent kernel execution– Out of Order thread block execution– Dual overlapped memory transfer engines
![Page 63: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/63.jpg)
Third Generation Streaming Multiprocessor
• 512 High Performance CUDA cores– Each SM features 32 CUDA processors
– Each CUDA processor has a fully pipelined integer arithmetic logic unit (ALU) and floating point unit (FPU)
• 16 Load/Store Units– Each SM has 16 load/store units,
allowing source and destination addresses to be calculated for sixteen threads per clock.
– Supporting units load and store the data at each address to cache or DRAM.
• Four Special Function Units– Special Function Units (SFUs) execute
transcendental instructions such as sin, cosine, reciprocal, and square root.
![Page 64: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/64.jpg)
Dual Warp Scheduler
• The SM schedules threads in groups of 32 parallel threads called warps. • Each SM features two warp schedulers and two instruction dispatch units, allowing two
warps to be issued and executed concurrently. • Fermi’s dual warp scheduler selects two warps, and issues one instruction from each
warp to a group of sixteen cores, sixteen load/store units, or four SFUs.• Because warps execute independently, Fermi’s scheduler does not need to check for
dependencies from within the instruction stream. • Using this elegant model of dual-issue, Fermi achieves near peak hardware
performance.
![Page 65: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/65.jpg)
Second Generation Parallel Thread Execution ISA
PTX is a low level virtual machine and ISA designed to support the operations of a parallel thread processor. At program install time, PTX instructions are translated to machine instructions by the GPU driver.
The primary goals of PTX are: – Provide a stable ISA that spans multiple GPU generations
– Achieve full GPU performance in compiled applications
– Provide a machine-independent ISA for C, C++, Fortran, and other compiler targets.
– Provide a code distribution ISA for application and middleware developers
– Provide a common ISA for optimizing code generators and translators, which map PTX to specific target machines.
– Facilitate hand-coding of libraries and performance kernels
– Provide a scalable programming model that spans GPU sizes from a few cores to many parallel cores
![Page 66: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/66.jpg)
Fermi and the PTX 2.0 ISA address space
Three separate address spaces (thread private local, block shared, and global) for load and store operations.•In PTX 1.0, load/store instructions werespecific to one of the three address spaces;
– programs could load/ store values in aspecific target address space known atcompile time.
– difficult to fully implement C/C++ pointerssince a pointer’s target address spacemay not be known at compile time.
•With PTX 2.0, a unified address spaceunifies all three address spaces into asingle, continuous address space. •40-bit unified address space supports aTerabyte of addressable memory, andthe load/store ISA supports 64-bitaddressing for future growth.
![Page 67: GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.itmartino.ruggiero@unibo.it a.marongiu@unibo.ita.marongiu@unibo.it.](https://reader035.fdocuments.net/reader035/viewer/2022081401/56649e445503460f94b380bc/html5/thumbnails/67.jpg)
Summary Table