GPUs DP Accelerators MSC
Transcript of GPUs DP Accelerators MSC
-
8/8/2019 GPUs DP Accelerators MSC
1/166
GPGPUs-Data Parallel Accelerators
Dezs Sima
Oct. 20. 2009
Dezs Sima 2009Ver. 1.0
-
8/8/2019 GPUs DP Accelerators MSC
2/166
2. Basics of the SIMT execution
Contents
1.Introduction
3. Overview of GPGPUs
4. Overview of data parallel accelerators
5. Microarchitecture of GPGPUs (examples)
5.1 AMD/ATI RV870 (Cypress)
5.2 Nvidia Fermi
6. References
5.3 Intels Larrabee
-
8/8/2019 GPUs DP Accelerators MSC
3/166
1. The emergence of GPGPUs
-
8/8/2019 GPUs DP Accelerators MSC
4/166
Vertex
Edge Surface
Vertices
have three spatial coordinates supplementary information necessary to render the object, such as
color texture
reflectance properties etc.
Representation of objects by triangels
1. Introduction (1)
-
8/8/2019 GPUs DP Accelerators MSC
5/166
Main types of shaders in GPUs
Shaders
Geometry shadersVertex shaders Pixel shaders(Fragment shaders)
Transform each vertexs3D-position in the virtual space
to the 2D coordinate,at which it appears on the screen
Calculate the colorof the pixels
Can add or removevertices from a mesh
1. Introduction (2)
-
8/8/2019 GPUs DP Accelerators MSC
6/166
DirectX version Pixel SM Vertex SM Supporting OS
8.0 (11/2000) 1.0, 1.1 1.0, 1.1 Windows 2000
8.1 (10/2001) 1.2, 1.3, 1.4 1.0, 1.1 Windows XP/ WindowsServer 2003
9.0 (12/2002) 2.0 2.0
9.0a (3/2003) 2_A, 2_B 2.x
9.0c (8/2004) 3.0 3.0 Windows XP SP2
10.0 (11/2006) 4.0 4.0 Windows Vista
10.1 (2/2008) 4.1 4.1 Windows Vista SP1/Windows Server 2008
11 (in development) 5.0 5.0
Table: Pixel/vertex shader models (SM) supported by subsequent versions of DirectXand MSs OSs [18], [21]
1. Introduction (3)
-
8/8/2019 GPUs DP Accelerators MSC
7/166
Convergence of important features of the vertex and pixel shader models
Subsequent shader models introduce typically, a number of new/enhanced features.
Shader model 2 [19]
Different precision requirements
Vertex shader: FP32 (coordinates)Pixel shader: FX24 (3 colors x 8)
Different instructions
Different resources (e.g. registers)
Differences between the vertex and pixel shader models in subsequent shader modelsconcerning precision requirements, instruction sets and programming resources.
Shader model 3 [19]
Unified precision requirements for both shaders (FP32)with the option to specify partial precision (FP16 or FP24)by adding a modifier to the shader code
Different instructions
Different resources (e.g. registers)
1. Introduction (4)
-
8/8/2019 GPUs DP Accelerators MSC
8/166
Shader model 4 (introduced with DirectX10) [20]
Unified precision requirements for both shaders (FP32)
with the possibility to use new data formats. Unified instruction set
Unified resources (e.g. temporary and constant registers)
Shader architectures of GPUs prior to SM4
GPUs prior to SM4 (DirectX 10):have separate vertex and pixel units with different features.
Drawback of having separate units for vertex and pixel shading
Inefficiency of the hardware implementation
(Vertex shaders and pixel shaders often have complementary load patterns [21]).
1. Introduction (5)
-
8/8/2019 GPUs DP Accelerators MSC
9/166
DirectX version Pixel SM Vertex SM Supporting OS
8.0 (11/2000) 1.0, 1.1 1.0, 1.1 Windows 2000
8.1 (10/2001) 1.2, 1.3, 1.4 1.0, 1.1 Windows XP/ WindowsServer 2003
9.0 (12/2002) 2.0 2.0
9.0a (3/2003) 2_A, 2_B 2.x
9.0c (8/2004) 3.0 3.0 Windows XP SP2
10.0 (11/2006) 4.0 4.0 Windows Vista
10.1 (2/2008) 4.1 4.1 Windows Vista SP1/Windows Server 2008
11 (in development) 5.0 5.0
Table: Pixel/vertex shader models (SM) supported by subsequent versions of DirectXand MSs OSs [18], [21]
1. Introduction (6)
-
8/8/2019 GPUs DP Accelerators MSC
10/166
Unified shader model (introduced in the SM 4.0 of DirectX 10.0)
The same (programmable) processor can be used to implement all shaders;
the vertex shader
the pixel shader and
the geometry shader (new feature of the SMl 4)
Unified, programable shader architecture
1. Introduction (7)
-
8/8/2019 GPUs DP Accelerators MSC
11/166
Figure: Principle of the unified shader architecture [22]
1. Introduction (8)
-
8/8/2019 GPUs DP Accelerators MSC
12/166
Based on its FP32 computing capability and the large number of FP-units available
the unified shader is a prospective candidate for speeding up HPC!
GPUs with unified shader architectures also termed as
GPGPUs
(General Purpose GPUs)
1. Introduction (9)
or
cGPUs(computational GPUs)
-
8/8/2019 GPUs DP Accelerators MSC
13/166
Figure: Peak SP FP performance of Nvidias GPUs vs Intel P4 and Core2 processors [11]
1. Introduction (10)
1 I d i (11)
-
8/8/2019 GPUs DP Accelerators MSC
14/166
Figure: Bandwidth values of Nvidias GPUs vs Intels P4 and Core2 processors [11]
1. Introduction (11)
1 I t d ti (12)
-
8/8/2019 GPUs DP Accelerators MSC
15/166
Figure: Contrasting the utilization of the silicon area in CPUs and GPUs [11]
1. Introduction (12)
-
8/8/2019 GPUs DP Accelerators MSC
16/166
2. Basics of the SIMT execution
2 B i f th SIMT ti (1)
-
8/8/2019 GPUs DP Accelerators MSC
17/166
Main alternatives of data parallel execution
Data parallel execution
SIMD execution SIMT execution
One dimensional data parallel execution,
i.e. it performs the same operation
on all elements of givenFX/FP input vectors
Two dimensional data parallel execution,
i.e. it performs the same operation
on all elements of givenFX/FP input arrays (matrices)
E.g. 2. and 3. generationsuperscalars
GPGPUs,data parallel accelerators
Figure: Main alternatives of data parallel execution
data dependent flow control as well as
barrier synchronization
is massively multithreaded,
and provides
Needs an FX/FP SIMD extensionof the ISA
Needs an FX/FP SIMT extensionof the ISA and the API
2. Basics of the SIMT execution (1)
2 B i f th SIMT ti (2)
-
8/8/2019 GPUs DP Accelerators MSC
18/166
Scalar execution SIMD execution SIMT execution
Domain of execution:single data elements
Domain of execution:elements of vectors
Domain of execution:elements of matrices
(at the programming level)
Figure: Domains of execution in case of scalar, SIMD and SIMT execution
2. Basics of the SIMT execution (2)
Remark
SIMT execution is also termed as SPMD (Single_Program Multiple_Data) execution (Nvidia)
Scalar, SIMD and SIMT execution
2 Basics of the SIMT execution (3)
-
8/8/2019 GPUs DP Accelerators MSC
19/166
Key components of the implementation of SIMT execution
Data parallel execution Massive multithreading
Data dependent flow control
Barrier synchronization
2. Basics of the SIMT execution (3)
2 Basics of the SIMT execution (4)
-
8/8/2019 GPUs DP Accelerators MSC
20/166
(i.e. all ALUs of a SIMT core perform typically the same operation).
Data parallel execution
Fetch/Decode
ALU ALU ALUALU
SIMT core
Figure: Basic layout of a SIMT core
ALU ALU ALUALU
Performed by SIMT cores
SIMT coresexecute the same instruction stream on a number ofALUs
SIMT cores are the basic building blocks ofGPGPU or data parallel accelerators.
2. Basics of the SIMT execution (4)
During SIMT execution 2-dimensional matrices will be mapped to blocks of SIMT cores.
2 Basics of the SIMT execution (5)
-
8/8/2019 GPUs DP Accelerators MSC
21/166
streaming multiprocessor (Nvidia),
superscalar shader processor (AMD),
wide SIMD processor, CPU core (Intel).
Remark 1
Different manufacturers designate SIMT cores differently, such as
2. Basics of the SIMT execution (5)
2 Basics of the SIMT execution (6)
-
8/8/2019 GPUs DP Accelerators MSC
22/166
Fetch/Decode
ALU ALU ALUALU
RF RF RF RF
Each ALU is allocated a working register set (RF)
Figure: Main functional blocks of a SIMT core
ALU ALU ALUALU
RFRFRFRF
2. Basics of the SIMT execution (6)
2 Basics of the SIMT execution (7)
-
8/8/2019 GPUs DP Accelerators MSC
23/166
SIMT ALUs perform typically, RRR operations, that is
ALUs take their operands from and write the calculated results to the register set
(RF) allocated to them.
ALU
RF
Figure: Principle of operation of the SIMD ALUs
2. Basics of the SIMT execution (7)
2 Basics of the SIMT execution (8)
-
8/8/2019 GPUs DP Accelerators MSC
24/166
Remark 2
ALU
RF RF RF RF RF RF RF RF
ALU ALU ALUALU ALU ALUALU ALU ALUALU ALU
Figure: Allocation of distinct parts of a large register set as workspaces of the ALUs
Actually, the register sets (RF) allocated to each ALU are given parts of alarge enough register file.
2. Basics of the SIMT execution (8)
2 Basics of the SIMT execution (9)
-
8/8/2019 GPUs DP Accelerators MSC
25/166
Basic operation of recent SIMT ALUs
ALU
RF
are pipelined,capable of starting a new operation every new clock cycle,(more precisely, every shader clock cycle),
execute basically SP FP-MADD(simple precision i.e. 32-bit.Multiply-Add) instructions of the form axb+c ,
need a few number of clock cycles, e.g. 2 or 4 shader cycles,to present the results of the SP FMADD operations to the RF,
That is, without further enhancements
their peak performance is 2 SP FP operations/cycle
2. Basics of the SIMT execution (9)
2 Basics of the SIMT execution (10)
-
8/8/2019 GPUs DP Accelerators MSC
26/166
Additional operations provided by SIMT ALUs
FX operationsand FX/FP conversions, DP FP operations,
trigonometric functions (usually supported by special functional units).
2. Basics of the SIMT execution (10)
2 Basics of the SIMT execution (11)
-
8/8/2019 GPUs DP Accelerators MSC
27/166
Aim of massive multithreadingto speed up computations by increasing the utilization of available computing resources
in case of stalls (e.g. due to cache misses).
2. Basics of the SIMT execution (11)
Massive multithreading
Suspend stalled threads from execution and allocate ready to run threads for execution.
When a large enough number of threads are available long stalls can be hidden.
Principle
2 Basics of the SIMT execution (12)
-
8/8/2019 GPUs DP Accelerators MSC
28/166
Multithreading is implemented by
creating and managing parallel executable threadsfor each data elementof the
execution domain.
Figure: Parallel executable threads for each element of the execution domain
Same instructionsfor all data elements
2. Basics of the SIMT execution (12)
2 Basics of the SIMT execution (13)
-
8/8/2019 GPUs DP Accelerators MSC
29/166
Effective implementation of multithreading
if thread switches, called context switches, do not cause cycle penalties.
providing separate contexts (register space) for each thread, and
implementing a zero-cycle context switch mechanism.
Achieved by
2. Basics of the SIMT execution (13)
2. Basics of the SIMT execution (14)
-
8/8/2019 GPUs DP Accelerators MSC
30/166
ALUALU ALU ALUALU ALU ALUALU ALU ALUALU ALU
CTX CTX CTX CTX CTX CTX CTXCTX
CTX CTX CTX CTX CTX CTX CTXCTX
CTX CTX CTX CTX CTX CTX CTXCTX
CTX CTX CTX CTX CTX CTX CTXCTX
CTX CTX CTX CTX CTX CTX CTXCTX
CTX CTX CTX CTX CTX CTX CTXCTX
Actual context Register file (RF)
Context switch
Figure: Providing separate thread contexts for each thread allocated for execution in a SIMT ALU
Fetch/Decode
SIMT core
2. Basics of the SIMT execution (14)
2. Basics of the SIMT execution (15)
-
8/8/2019 GPUs DP Accelerators MSC
31/166
Data dependent flow control
Implemented by SIMT branch processing
In SIMT processing both paths of a branch are executed subsequently such that
for each path the prescribed operations are executed only on those data elements whichfulfill the data condition given for that path (e.g. xi > 0).
Example
2. Basics of the SIMT execution (15)
2. Basics of the SIMT execution (16)
-
8/8/2019 GPUs DP Accelerators MSC
32/166
Figure: Execution of branches [24]
The given condition will be checked separately for each thread
2. Basics of the SIMT execution (16)
2. Basics of the SIMT execution (17)
-
8/8/2019 GPUs DP Accelerators MSC
33/166
Figure: Execution of branches [24]
First all ALUs meeting the condition execute the prescibed three operations,then all ALUs missing the condition execute the next two operatons
2. Basics of the SIMT execution (17)
2. Basics of the SIMT execution (18)
-
8/8/2019 GPUs DP Accelerators MSC
34/166
Figure: Resuming instruction stream processing after executing a branch [24]
( )
2. Basics of the SIMT execution (19)
-
8/8/2019 GPUs DP Accelerators MSC
35/166
Barrier synchronization
Implemented e.g. in AMDs Intermediate Language (IL) by the fence threads instruction [10].
In the R600 ISA this instruction is coded by setting the BARRIER field of the Control Flow(CF) instruction format [7].
Remark
Lets wait all threads for completing all prior instructions before executing the next instruction.
( )
2. Basics of the SIMT execution (20)
-
8/8/2019 GPUs DP Accelerators MSC
36/166
Each kernel invocationlets execute all
thread blocks (Block(i,j))kernel0()
kernel1()
Host Device
Figure: Hierarchy ofthreads [25]
Principle of SIMT execution
( )
-
8/8/2019 GPUs DP Accelerators MSC
37/166
3. Overview of GPGPUs
3. Overview of GPGPUs (1)
-
8/8/2019 GPUs DP Accelerators MSC
38/166
Basic implementation alternatives of the SIMT execution
GPGPUs Data parallel accelerators
Dedicated unitssupporting data parallel execution
with appropriateprogramming environment
Programmable GPUswith appropriate
programming environments
E.g. Nvidias 8800 and GTX linesAMDs HD 38xx, HD48xx lines
Nvidias Tesla linesAMDs FireStream lines
Have display outputs No display outputsHave larger memories
than GPGPUs
Figure: Basic implementation alternatives of the SIMT execution
( )
3. Overview of GPGPUs (2)
-
8/8/2019 GPUs DP Accelerators MSC
39/166
GPGPUs
Nvidias line AMD/ATIs line
Figure: Overview of Nvidias and AMD/ATIs GPGPU lines
90 nm G80
65 nm G92 G200
Shrink Enhancedarch.
80 nm R600
55 nm RV670 RV770
Shrink Enhancedarch.
40 nm Fermi
Shrink
40 nm RV870
ShrinkEnhanced
arch.
Enhancedarch.
-
8/8/2019 GPUs DP Accelerators MSC
40/166
3. Overview of GPGPUs (4)
-
8/8/2019 GPUs DP Accelerators MSC
41/166
8800 GTS 8800 GTX 8800 GT GTX 260 GTX 280
Core G80 G80 G92 GT200 GT200
Introduction 11/06 11/06 10/07 6/08 6/08
IC technology 90 nm 90 nm 65 nm 65 nm 65 nm
Nr. of transistors 681 mtrs 681 mtrs 754 mtrs 1400 mtrs 1400 mtrs
Die are 480 mm2 480 mm2 324 mm2 576 mm2 576 mm2
Core frequency 500 MHz 575 MHz 600 MHz 576 MHz 602 MHz
Computation
No.of ALUs 96 128 112 192 240
Shader frequency 1.2 GHz 1.35 GHz 1.512 GHz 1.242 GHz 1.296 GHz
No. FP32 inst./cycle 3* (but only in a few issue cases) 3 3
Peak FP32 performance 346 GLOPS 512 GLOPS 508 GLOPS 715 GLOPS 933 GLOPS
Peak FP64 performance 77/76 GLOPS
Memory
Mem. transfer rate (eff) 1600 Mb/s 1800 Mb/s 1800 Mb/s 1998 Mb/s 2214 Mb/s
Mem. interface 320-bit 384-bit 256-bit 448-bit 512-bit
Mem. bandwidth 64 GB/s 86.4 GB/s 57.6 GB/s 111.9 GB/s 141.7 GB/s
Mem. size 320 MB 768 MB 512 MB 896 MB 1.0 GB
Mem. type GDDR3 GDDR3 GDDR3 GDDR3 GDDR3
Mem. channel 6*64-bit 6*64-bit 4*64-bit 8*64-bit 8*64-bit
Mem. contr. Crossbar Crossbar Crossbar Crossbar Crossbar
System
Multi. CPU techn. SLI SLI SLI SLI SLI
Interface PCIe x16 PCIe x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16
MS Direct X 10 10 10 10.1 subset 10.1 subset
Table: Main features of Nvidias GPGPUs
3. Overview of GPGPUs (5)
-
8/8/2019 GPUs DP Accelerators MSC
42/166
HD 2900XT HD 3850 HD 3870 HD 4850 HD 4870
Core R600 R670 R670 RV770 RV770
Introduction 5/07 11/07 11/07 5/08 5/08
IC technology 80 nm 55 nm 55 nm 55 nm 55 nm
Nr. of transistors 700 mtrs 666 mtrs 666 mtrs 956 mtrs 956 mtrs
Die are 408 mm2 192 mm2 192 mm2 260 mm2 260 mm2
Core frequency 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz
Computation
No. of ALUs 320 320 320 800 800
Shader frequency 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz
No. FP32 inst./cycle 2 2 2 2 2
Peak FP32 performance 471.6 GLOPS 429 GLOPS 496 GLOPS 1000 GLOPS 1200 GLOPS
Peak FP64 performance 200 GLOPS 240 GLOPS
Memory
Mem. transfer rate (eff) 1600 Mb/s 1660 Mb/s 2250 Mb/s 2000 Mb/s 3600 Mb/s (GDDR5)
Mem. interface 512-bit 256-bit 256-bit 265-bit 265-bit
Mem. bandwidth 105.6 GB/s 53.1 GB/s 720 GB/s 64 GB/s 118 GB/s
Mem. size 512 MB 256 MB 512 MB 512 MB 512 MB
Mem. type GDDR3 GDDR3 GDDR4 GDDR3 GDDR3/GDDR5
Mem. channel 8*64-bit 8*32-bit 8*32-bit 4*64-bit 4*64-bit
Mem. contr. Ring bus Ring bus Ring bus Crossbar Crossbar
System
Multi. CPU techn. CrossFire CrossFire X CrossFire X CrossFire X CrossFire X
Interface PCIe x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16
MS Direct X 10 10.1 10.1 10.1 10.1
Table: Main features of AMD/ATIs GPGPUs
3. Overview of GPGPUs (6)
-
8/8/2019 GPUs DP Accelerators MSC
43/166
Price relations (as of 10/2008)
Nvidia
GTX260 ~ 300 $
GTX280 ~ 600 $
AMD/ATI
HD4850 ~ 200 $
HD4870 na
-
8/8/2019 GPUs DP Accelerators MSC
44/166
4. Overview of data parallel accelerators
4. Overview of data parallel accelerators (1)
-
8/8/2019 GPUs DP Accelerators MSC
45/166
Implementation alternatives of data parallel accelerators
On-dieintegration
On cardimplementation
Recentimplementations
Futureimplementations
E.g. GPU cards
Data-parallelaccelerator cards
Intels Heavendahl
AMDs Torrenzaintegration technology
AMDs Fusionintegration technology
Trend
Figure: Implementation alternatives of dedicated data parallel accelerators
Data parallel accelerators
4. Overview of data parallel accelerators (2)
-
8/8/2019 GPUs DP Accelerators MSC
46/166
On-card accelerators
1U serverimplementations
Cardimplementations
Desktopimplementations
Usually dual cardsmounted into a box,
connected to anadapter card
that is inserted into afree PCI-E x16 slotof thehost PC through a cable.
E.g. Nvidia Tesla D870 Nvidia Tesla S870
Nvidia Tesla S1070AMD FireStream 9250
Nvidia Tesla C870
Nvidia Tesla C1060AMD FireStream 9170
Usually 4 cardsmounted into a 1U server rack,
connected two adapter cardsthat are inserted into
two free PCIEx16 slots of a serverthrough two switches
and two cables.
Single cards fitting
into a free PCI Ex16 slotof the host computer.
Figure:Implementation alternatives of on-card accelerators
4. Overview of data parallel accelerators (3)
-
8/8/2019 GPUs DP Accelerators MSC
47/166
Figure: Main functional units of Nvidias Tesla C870 card [2]
FB: Frame Buffer
4. Overview of data parallel accelerators (4)
-
8/8/2019 GPUs DP Accelerators MSC
48/166
Figure: Nvidas Tesla C870 andAMDs FireStream 9170 cards [2], [3]
4. Overview of data parallel accelerators (5)
-
8/8/2019 GPUs DP Accelerators MSC
49/166
Figure: Tesla D870 desktop implementation [4]
4. Overview of data parallel accelerators (6)
-
8/8/2019 GPUs DP Accelerators MSC
50/166
Figure: Nvidias Tesla D870 desktop implementation [4]
4. Overview of data parallel accelerators (7)
-
8/8/2019 GPUs DP Accelerators MSC
51/166
Figure: PCI-E x16 host adapter card of Nvidias Tesla D870 desktop [4]
4. Overview of data parallel accelerators (8)
-
8/8/2019 GPUs DP Accelerators MSC
52/166
Figure: Concept of Nvidias Tesla S870 1U rack server [5]
4. Overview of data parallel accelerators (9)
-
8/8/2019 GPUs DP Accelerators MSC
53/166
Figure: Internal layout of Nvidias Tesla S870 1U rack [6]
4. Overview of data parallel accelerators (10)
-
8/8/2019 GPUs DP Accelerators MSC
54/166
Figure: Connection cable between Nvidias Tesla S870 1U rack and the adapter cardsinserted into PCI-E x16 slots of the host server [6]
4. Overview of data parallel accelerators (11)
-
8/8/2019 GPUs DP Accelerators MSC
55/166
6/08
GT200-based4 GB GDDR30.936 GLOPS
6/07
G80-based1.5 GB GDDR30.519 GLOPS
Card
Desktop
IU Server
C870
2007 2008
C1060
CUDA
NVidia Tesla
6/07
G80-based2*C870 incl.3 GB GDDR31.037 GLOPS
D870
6/07
G80-based4*C870 incl.6 GB GDDR32.074 GLOPS
S870
6/07
Version 1.0
6/08
GT200-based4*C1060
16 GB GDDR33.744 GLOPS
S1070
11/07
Version 1.01
6/08
Version 2.0
Figure: Overview of Nvidias Tesla family
4. Overview of data parallel accelerators (12)
-
8/8/2019 GPUs DP Accelerators MSC
56/166
6/08
Shipped
11/07
RV670-based2 GB GDDR3
500 GLOPS FP32~200 GLOPS FP64
Card
Stream ComputingSDK
9170
2007 2008
9170
Rapid Mind
AMD FireStream
6/08
RV770-based1 GB GDDR31 TLOPS FP32
~300 GFLOPS FP64
9250
12/07
Brook+ACM/AMD Core Math LibraryCAL (Computer Abstor Layer)
Version 1.0
10/08
Shipped
9250
Figure: Overview of AMD/ATIs FireStream family
4. Overview of data parallel accelerators (13)
-
8/8/2019 GPUs DP Accelerators MSC
57/166
Nvidia Tesla cards AMD FireStream cards
Core type C870 C1060 9170 9250
Based on G80 GT200 RV670 RV770
Introduction 6/07 6/08 11/07 6/08
Core
Core frequency 600 MHz 602 MHz 800 MHz 625 MHz
ALU frequency 1350 MHz 1296 GHz 800 MHz 325 MHZ
No. of ALUs 128 240 320 800
Peak FP32 performance 518 GLOPS 933 GLOPS 512 GLOPS 1 TLOPS
Peak FP64 performance ~200 GLOPS ~250 GLOPS
Memory
Mem. transfer rate (eff) 1600 Gb/s 1600 Gb/s 1600 Gb/s 1986 Gb/s
Mem. interface 384-bit 512-bit 256-bit 256-bit
Mem. bandwidth 768 GB/s 102 GB/s 51.2 GB/s 63.5 GB/s
Mem. size 1.5 GB 4 GB 2 GB 1 GB
Mem. type GDDR3 GDDR3 GDDR3 GDDR3
System
Interface PCI-E x16 PCI-E 2.0x16 PCI-E 2.0x16 PCI-E 2.0x16
Power (max) 171 W 200 W 150 W 150 W
Table: Main features of Nvidias and AMD/ATIs data parallel accelerator cards
4. Overview of data parallel accelerators (14)
-
8/8/2019 GPUs DP Accelerators MSC
58/166
Price relations (as of 10/2008)
Nvidia Tesla
C870 ~ 1500 $
D870 ~ 5000 $
S870 ~ 7500 $
C1060 ~ 1600 $
S1070 ~ 8000 $
AMD/ATI FireStream
9170 ~ 800 $ 9250 ~ 800 $
-
8/8/2019 GPUs DP Accelerators MSC
59/166
5. Microarchitecture of GPGPUs (examples)
5.1 AMD/ATI RV870 (Cypress)
5.2 Nvidia Fermi
5.3 Intels Larrabee
-
8/8/2019 GPUs DP Accelerators MSC
60/166
5.1 AMD/ATI RV870
5.1 AMD/ATI RV870 (1)
-
8/8/2019 GPUs DP Accelerators MSC
61/166
OpenCL 1.0 compliant
AMD/ATI RV870 (Cypress) Radeon 5870 graphics card
Introduction: Sept. 22 2009Availability: now
Performance figures:
SP FP performance: 2.72 TFLOPS
DP FP performance: 544 GFLOPS (1/5 of SP FP performance)
5.1 AMD/ATI RV870 (2)
-
8/8/2019 GPUs DP Accelerators MSC
62/166
Radeon series/5800
ATI Radeon HD 4870 ATI Radeon HD
5850
ATI Radeon HD
5870
Manufacturing Process 55-nm 40-nm 40-nm
# of Transistors 956 million 2.15 billion 2.15 billion
Core Clock Speed 750MHz 725MHz 850MHz
# of Stream Processors 800 1440 1600Compute Performance 1.2 TFLOPS 2.09 TFLOPS 2.72 TFLOPS
Memory Type GDDR5 GDDR5 GDDR5
Memory Clock 900MHz 1000MHz 1200MHz
Memory Data Rate 3.6 Gbps 4.0 Gbps 4.8 Gbps
Memory Bandwidth 115.2 GB/sec 128 GB/sec 153.6 GB/sec
Max Board Power 160W 170W 188W
Idle Board Power 90W 27W 27W
Figure: Radeon Series/5800 [42]
5.1 AMD/ATI RV870 (3)
-
8/8/2019 GPUs DP Accelerators MSC
63/166
Radeon 4800 series/5800 series comparison
ATI Radeon HD 4870 ATI Radeon HD
5850
ATI Radeon HD
5870
Manufacturing Process 55-nm 40-nm 40-nm
# of Transistors 956 million 2.15 billion 2.15 billion
Core Clock Speed 750MHz 725MHz 850MHz
# of Stream Processors 800 1440 1600Compute Performance 1.2 TFLOPS 2.09 TFLOPS 2.72 TFLOPS
Memory Type GDDR5 GDDR5 GDDR5
Memory Clock 900MHz 1000MHz 1200MHz
Memory Data Rate 3.6 Gbps 4.0 Gbps 4.8 Gbps
Memory Bandwidth 115.2 GB/sec 128 GB/sec 153.6 GB/sec
Max Board Power 160W 170W 188W
Idle Board Power 90W 27W 27W
Figure: Radeon Series/5800 [42]
5.1 AMD/ATI RV870 (4)
-
8/8/2019 GPUs DP Accelerators MSC
64/166
8x32 = 256 bitGDDR5
153.6 GB/s
1600 EUs(Stream processing units)
Architecture overview
20 cores
16 ALUs/core
5 EUs/ALU
Figure: Architectureoverview [42]
5.1 AMD/ATI RV870 (5)
-
8/8/2019 GPUs DP Accelerators MSC
65/166
The 5870 card
Figure: The 5870 card [41]
-
8/8/2019 GPUs DP Accelerators MSC
66/166
5.2 Nvidia Fermi
5.2 Nvidia Fermi (1)
-
8/8/2019 GPUs DP Accelerators MSC
67/166
NVidias Fermi
Introduced: 30. Sept. 2009 at NVidias GPU Technology Conference Available: 1 Q 2010
5.2 Nvidia Fermi (2)
-
8/8/2019 GPUs DP Accelerators MSC
68/166
NVidia: 16 cores(Streaming Multiprocessors)
6x Dual Channel GDDR5(384 bit)
Fermis overall structure
Each core: 32 ALUs
Figure: Fermis overall structure [40]
5.2 Nvidia Fermi (3)
-
8/8/2019 GPUs DP Accelerators MSC
69/166
Cuda core(ALU)
1 SM includes 32 ALUs
called Cuda cores by NVidia)
Layout of a core (SM)
Figure: Layout of a core [40]
5.2 Nvidia Fermi (4)
-
8/8/2019 GPUs DP Accelerators MSC
70/166
A single ALU (Cuda core)
SP FP:32-bit FX: 32-bit
Needs 2 clock cycles
DP FP performance: of SP FP performance!!
DP FP
IEEE 754-2008-compliant
Figure: A single ALU [40]
5.2 Nvidia Fermi (5)
-
8/8/2019 GPUs DP Accelerators MSC
71/166
Fermis system architecture
Figure: Fermis system architecture [39]
5.2 Nvidia Fermi (6)
-
8/8/2019 GPUs DP Accelerators MSC
72/166
Contrasting Fermi and GT 200
Figure: Contrasting Fermi and GT 200 [39]
5.2 Nvidia Fermi (7)
-
8/8/2019 GPUs DP Accelerators MSC
73/166
Each kernel invocationexecutes a grid of
thread blocks (Block(i,j))kernel0()
kernel1()
Host Device
Figure: Hierarchy ofthreads [25]
The execution of programs utilizing GP/GPUs
5.2 Nvidia Fermi (8)
-
8/8/2019 GPUs DP Accelerators MSC
74/166
Global scheduling in Fermi
Figure: Global scheduling in Fermi [39]
5.2 Nvidia Fermi (9)
-
8/8/2019 GPUs DP Accelerators MSC
75/166
Microarchitecture of a Fermi core
5.2 Nvidia Fermi (10)
-
8/8/2019 GPUs DP Accelerators MSC
76/166
Principle of operation of the G80/G92/Fermi GPGPUs
5.2 Nvidia Fermi (11)
-
8/8/2019 GPUs DP Accelerators MSC
77/166
Work scheduling
Scheduling thread blocks for execution
Segmenting thread blocks into warps
Scheduling warps for execution
Principle of operation of the G80/G92 GPGPUs
The key point of operation is work scheduling
CUDA Th d Bl k
5.2 Nvidia Fermi (12)
Th ead sched ling in NVidias GPGPUs
-
8/8/2019 GPUs DP Accelerators MSC
78/166
CUDA Thread Block All threads in a block execute the same
kernel program (SPMD)
Programmer declares block: Block size 1 to 512 concurrent threads Block shape 1D, 2D, or 3D Block dimensions in threads
Threads have thread id numbers within
block Thread program uses thread id to selectwork and address shared data
Threads in the same block share data andsynchronize while doing their share of the
work Threads in different blocks cannot
cooperate Each block can execute in any order
relative to other blocs!
CUDA Thread Block
Thread Id #:
0 1 2 3 m
Thread program
Courtesy: John Nickolls,NVIDIA
linois.edu/ece498/al/lectures/lecture4%20cuda%20threads%20part2%20spring%202009.ppt#316,2
Thread scheduling in NVidias GPGPUs
5.2 Nvidia Fermi (13)
-
8/8/2019 GPUs DP Accelerators MSC
79/166
t0 t1 t2 tm
Texture L1
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
TF
L2
Memory
t0 t1 t2 tm
BlocksBlocks
SM0 SM1
TPC
Figure: Assigning thread blocksto streaming multiprocessors (SM) for execution [12]
Up to 8 blocks can be assignedto an SM for execution
Scheduling thread blocks for execution
TPC: Thread Processing Cluster(Texture Processing Cluster)
A TPC has
2 SMs in the G80/G923 SMs in the G200
A device may run thread blocks sequentiallyor even in parallel, if it has enough resources
for this, or usually by a combination of both.
5.2 Nvidia Fermi (14)
-
8/8/2019 GPUs DP Accelerators MSC
80/166
t0 t1 t2 t31
t0 t1 t2 t31
Block 1 Warps Block 2 Warps
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1 Data L1
Streaming Multiprocessor
Shared Memory
Segmenting thread blocks into warps
Threads are scheduled for execution in groups
of 32 threads, called the warps.
For scheduling each thread block is subdividedinto warps.
At any point of time up to 24 warps can bemaintained by the scheduler.
Figure: Segmenting thread blocks in warps [12]
Remark
The number of threads constituting a warpis an implementation decision and notpart of the CUDA programming model.
5.2 Nvidia Fermi (15)
-
8/8/2019 GPUs DP Accelerators MSC
81/166
Scheduling warps for execution
warp 8 instruction 11
SM multithreadedWarp scheduler
warp 1 instruction 42
warp 3 instruction 95
warp 8 instruction 12
...
time
warp 3 instruction 96Figure: Scheduling warps for execution [12]
The warp scheduler is a zero-overhead scheduler
Only those warps are eligible for executionwhose next instruction has all operands available.
Eligible warps are scheduled
coarse grained (not indicated in the figure) priority based.
All threads in a warp execute the same instructionwhen selected.
4 clock cycles are needed to dispatch the sameinstruction to all threads in the warp (G80)
-
8/8/2019 GPUs DP Accelerators MSC
82/166
5.3 Intels Larrabee
5.3 Intels Larrabee (1)
-
8/8/2019 GPUs DP Accelerators MSC
83/166
Larrabee
Part of Intels Tera-Scale Initiative.
Project started ~ 2005
First unofficial public presentation: 03/2006 (withdrawn)First brief public presentation 09/07 (Otellini) [29]
First official public presentations: in 2008 (e.g. at SIGGRAPH [27])
Due in ~ 2009
Performance (targeted):
2 TFlops
Brief history:
Objectives:
Not a single product but a base architecture for a number of different products.
High end graphics processing, HPC
5.3 Intels Larrabee (2)
-
8/8/2019 GPUs DP Accelerators MSC
84/166
NI: New Instructions
Figure: Positioning of Larrabeein Intels product portfolio [28]
5.2 Intels Larrabee (3)
-
8/8/2019 GPUs DP Accelerators MSC
85/166
Figure: First public presentation of Larrabee at IDF Fall 2007 [29]
5.3 Intels Larrabee (4)
-
8/8/2019 GPUs DP Accelerators MSC
86/166
Figure: Block diagram of the Larrabee [30]
Basic architecture
Cores: In order x86 IA cores augmented with new instructions
L2 cache: fully coherent
Ring bus: 1024 bits wide
5.3 Intels Larrabee (5)
-
8/8/2019 GPUs DP Accelerators MSC
87/166
Figure: Block diagram of Larrabees cores [31]
5.3 Intels Larrabee (6)
-
8/8/2019 GPUs DP Accelerators MSC
88/166
Larrabee microarchitecture [27]
Derived from that of the Pentiums in order design
5.3 Intels Larrabee (7)
-
8/8/2019 GPUs DP Accelerators MSC
89/166
Figure: The anchestor ofLarrabees cores [28]
64-bit instructions
4-way multithreaded(with 4 register sets)
addition of a 16-wide(16x32-bit) VU
increased L1 caches(32 KB vs 8 KB)
access to its 256 KBlocal subset of acoherent L2 cache
ring network to access
the coherent L2 $and allow interproc.communication.
Main extensions
5.3 Intels Larrabee (8)
-
8/8/2019 GPUs DP Accelerators MSC
90/166
New instructions allow explicit cache control
the L2 cache can be used as a scratchpad memory while remaining fullycoherent.
to prefetch data into the L1 and L2 caches
to control the eviction of cache lines by reducing their priority.
5.3 Intels Larrabee (9)
-
8/8/2019 GPUs DP Accelerators MSC
91/166
The Scalar Unit
supports the full ISA of the Pentium(it can run existing code including OS kernels and applications)
bit count
bit scan (it finds the next bit set within a register).
provides new instructions, e.g. for
5.3 Intels Larrabee (10)
-
8/8/2019 GPUs DP Accelerators MSC
92/166
Figure: Block diagram of the Vector Unit [31]
The Vector Unit
VU scatter-gather instructions
(load a VU vector register from16 non-contiguous data locationsfrom anywhere from the
on die L1 cache without penalty,or store a VU register similarly).
8-bit, 16-bit integer and 16 bit FPdata can be read from the L1 $or written into the L1 $,
with conversion to 32-bit integerswithout penalty.
Numeric conversions
L1 D$ becomesas an extension of theregister file
Mask registers
have one bit per bit lane,to control which bits of a vector reg.
or memory data are read or writtenand which remain untouched.
5.3 Intels Larrabee (11)
-
8/8/2019 GPUs DP Accelerators MSC
93/166
Figure: Layout of the 16-wide vector ALU [31]
ALUs execute integer, SP and DP FP instructions
Multiply-add instructions are available.
ALUs
5.3 Intels Larrabee (12)
-
8/8/2019 GPUs DP Accelerators MSC
94/166
Task scheduling
performed entirely by software rather than by hardware, like in Nvidias or AMD/ATIsGPGPUs.
5.3 Intels Larrabee (13)
-
8/8/2019 GPUs DP Accelerators MSC
95/166
SP FP performance
2 operations/cycle16 ALUs
32 operations/core
At present no data available for the clock frequency or the number of cores in Larrabee.
Assuming a clock frequency of 2 GHz and 32 cores
SP FP performance: 2 TFLOPS
5.3 Intels Larrabee (14)
-
8/8/2019 GPUs DP Accelerators MSC
96/166
Figure: Larrabees software stack (Source Intel)
Larrabees Native C/C++ compiler allows many available apps to be recompiled and run
correctly with no modifications.
6. References
6. References (1)
-
8/8/2019 GPUs DP Accelerators MSC
97/166
[2]: NVIDIA Tesla C870 GPU Computing Board, Board Specification, Jan. 2008, Nvidia
[1]: Torricelli F., AMD in HPC, HPC07,http://www.altairhyperworks.co.uk/html/en-GB/keynote2/Torricelli_AMD.pdf
[3] AMD FireStream 9170,http://ati.amd.com/technology/streamcomputing/product_firestream_9170.html
[4]: NVIDIA Tesla D870 Deskside GPU Computing System, System Specification, Jan. 2008,Nvidia,http://www.nvidia.com/docs/IO/43395/D870-SystemSpec-SP-03718-001_v01.pdf
[5]: Tesla S870 GPU Computing System, Specification, Nvida,http://jp.nvidia.com/docs/IO/43395/S870-BoardSpec_SP-03685-001_v00b.pdf
[6]: Torres G., Nvidia Tesla Technology, Nov. 2007,http://www.hardwaresecrets.com/article/495
[7]: R600-Family Instruction Set Architecture, Revision 0.31, May 2007, AMD
[8]: Zheng B., Gladding D., Villmow M., Building a High Level Language Compiler for GPGPU,ASPLOS 2006, June 2008
[9]: Huddy R., ATI Radeon HD2000 Series Technology Overview, AMD Technology Day, 2007http://ati.amd.com/developer/techpapers.html
[10]: Compute Abstraction Layer (CAL) Technology Intermediate Language (IL),
Version 2.0, Oct. 2008, AMD
6. References (2)
-
8/8/2019 GPUs DP Accelerators MSC
98/166
[11]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0,June 2008, Nvidia
[12]: Kirk D. & Hwu W. W., ECE498AL Lectures 7: Threading Hardware in G80, 2007,
University of Illinois, Urbana-Champaign, http://courses.ece.uiuc.edu/ece498/al1/lectures/lecture7-threading%20hardware.ppt#256,1,ECE 498AL Lectures 7:Threading Hardware in G80
[13]: Kogo H., R600 (Radeon HD2900 XT), PC Watch, June 26 2008,http://pc.watch.impress.co.jp/docs/2008/0626/kaigai_3.pdf
[14]: Nvidia G80, Pc Watch, April 16 2007,http://pc.watch.impress.co.jp/docs/2007/0416/kaigai350.htm
[15]: GeForce 8800GT (G92), PC Watch, Oct. 31 2007,http://pc.watch.impress.co.jp/docs/2007/1031/kaigai398_07.pdf
[16]: NVIDIA GT200 and AMD RV770, PC Watch, July 2 2008, http://pc.watch.impress.co.jp/docs/2008/0702/kaigai451.htm
[17]: Shrout R., Nvidia GT200 Revealed GeForce GTX 280 and GTX 260 Review,
PC Perspective, June 16 2008,http://www.pcper.com/article.php?aid=577&type=expert&pid=3
[18]: http://en.wikipedia.org/wiki/DirectX
[19]: Dietrich S., Shader Model 3.0, April 2004, Nvidia,http://www.cs.umbc.edu/~olano/s2004c01/ch15.pdf
[20]: Microsoft DirectX 10: The Next-Generation Graphics API, Technical Brief, Nov. 2006,
Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html
6. References (3)
-
8/8/2019 GPUs DP Accelerators MSC
99/166
[21]: Patidar S. & al., Exploiting the Shader Model 4.0 Architecture, Center forVisual Information Technology, IIIT Hyderabad,http://research.iiit.ac.in/~shiben/docs/SM4_Skp-Shiben-Jag-PJN_draft.pdf
[22]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia,http://www.nvidia.com/page/8800_tech_briefs.html
[24]: Fatahalian K., From Shader Code to a Teraflop: How Shader Cores Work,
Workshop: Beyond Programmable Shading: Fundamentals, SIGGRAPH 2008,
[25]: Kanter D., NVIDIAs GT200: Inside a Parallel Processor, 09-08-2008
[23]: Graphics Pipeline Rendering History, Aug. 22 2008, PC Watch,http://pc.watch.impress.co.jp/docs/2008/0822/kaigai_06.pdf
[26]: Nvidia CUDA Compute Unified Device Architecture Programming Guide,Version 1.1, Nov. 2007, Nvidia
[27]: Seiler L. & al., Larrabee: A Many-Core x86 Architecture for Visual Computing,ACM Transactions on Graphics, Vol. 27, No. 3, Article No. 18, Aug. 2008
[29]: Shrout R., IDF Fall 2007 Keynote, Sept. 18, 2007, PC Perspective,http://www.pcper.com/article.php?aid=453
[28]: Kogo H., Larrabee, PC Watch, Oct. 17, 2008,http://pc.watch.impress.co.jp/docs/2008/1017/kaigai472.htm
6. References (4)
-
8/8/2019 GPUs DP Accelerators MSC
100/166
[30]: Stokes J., Larrabee: Intels biggest leap ahead since the Pentium Pro,Aug. 04. 2008, http://arstechnica.com/news.ars/post/20080804-larrabee-intels-biggest-leap-ahead-since-the-pentium-pro.html
[31]: Shimpi A. L. C Wilson D., Intel's Larrabee Architecture Disclosure: A CalculatedFirst Move, Anandtech, Aug. 4. 2008,http://www.anandtech.com/showdoc.aspx?i=3367&p=2
[32]: Hester P., Multi_Core and Beyond: Evolving the x86 Architecture, Hot Chips 19,Aug. 2007, http://www.hotchips.org/hc19/docs/keynote2.pdf
[33]: AMD Stream Computing, User Guide, Oct. 2008, Rev. 1.2.1
http://ati.amd.com/technology/streamcomputing/Stream_Computing_User_Guide.pdf
[34]: Doggett M., Radeon HD 2900, Graphics Hardware Conf. Aug. 2007,http://www.graphicshardware.org/previous/www_2007/presentations/doggett-radeon2900-gh07.pdf
[35]: Mantor M., AMDs Radeon Hd 2900, Hot Chips 19, Aug. 2007,http://www.hotchips.org/archives/hc19/2_Mon/HC19.03/HC19.03.01.pdf
[36]: Houston M., Anatomy if AMDs TeraScale Graphics Engine,, SIGGRAPH 2008,http://s08.idav.ucdavis.edu/houston-amd-terascale.pdf
[37]: Mantor M., Entering the Golden Age of Heterogeneous Computing, PEEP 2008,http://ati.amd.com/technology/streamcomputing/IUCAA_Pune_PEEP_2008.pdf
6. References (5)
-
8/8/2019 GPUs DP Accelerators MSC
101/166
[38]: Kogo H., RV770 Overview, PC Watch, July 02 2008,http://pc.watch.impress.co.jp/docs/2008/0702/kaigai_09.pdf
[39]: Kanter D., Inside Fermi: Nvidia's HPC Push, Real World Technologies Sept 30 2009,http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT093009110932&mode=print
[40]: Wasson S., Inside Fermi: Nvidia's 'Fermi' GPU architecture revealed,Tech Report, Sept 30 2009, http://techreport.com/articles.x/17670/1
[41]: Wasson S., AMD's Radeon HD 5870 graphics processor,
Tech Report, Sept 23 2009, http://techreport.com/articles.x/17618/1
[42]: Bell B., ATI Radeon HD 5870 Performance Preview ,Firing Squad, Sept 22 2009, http://www.firingsquad.com/hardware/ati_radeon_hd_5870_performance_preview/default.asp
-
8/8/2019 GPUs DP Accelerators MSC
102/166
5.3 Intels Larrabee
5.2 Intels Larrabee (1)
-
8/8/2019 GPUs DP Accelerators MSC
103/166
Larrabee
Part of Intels Tera-Scale Initiative.
Project started ~ 2005
First unofficial public presentation: 03/2006 (withdrawn)First brief public presentation 09/07 (Otellini) [29]
First official public presentations: in 2008 (e.g. at SIGGRAPH [27])
Due in ~ 2009
Performance (targeted):
2 TFlops
Brief history:
Objectives:
Not a single product but a base architecture for a number of different products.
High end graphics processing, HPC
5.2 Intels Larrabee (2)
-
8/8/2019 GPUs DP Accelerators MSC
104/166
NI: New Instructions
Figure: Positioning of Larrabeein Intels product portfolio [28]
5.2 Intels Larrabee (3)
-
8/8/2019 GPUs DP Accelerators MSC
105/166
Figure: First public presentation of Larrabee at IDF Fall 2007 [29]
5.2 Intels Larrabee (4)
-
8/8/2019 GPUs DP Accelerators MSC
106/166
Figure: Block diagram of the Larrabee [30]
Basic architecture
Cores: In order x86 IA cores augmented with new instructions
L2 cache: fully coherent
Ring bus: 1024 bits wide
5.2 Intels Larrabee (5)
-
8/8/2019 GPUs DP Accelerators MSC
107/166
Figure: Block diagram of Larrabees cores [31]
5.2 Intels Larrabee (6)
-
8/8/2019 GPUs DP Accelerators MSC
108/166
Larrabee microarchitecture [27]
Derived from that of the Pentiums in order design
5.2 Intels Larrabee (7)
-
8/8/2019 GPUs DP Accelerators MSC
109/166
Figure: The anchestor ofLarrabees cores [28]
64-bit instructions
4-way multithreaded(with 4 register sets)
addition of a 16-wide(16x32-bit) VU
increased L1 caches(32 KB vs 8 KB)
access to its 256 KBlocal subset of acoherent L2 cache
ring network to accessthe coherent L2 $and allow interproc.communication.
Main extensions
5.2 Intels Larrabee (8)
-
8/8/2019 GPUs DP Accelerators MSC
110/166
New instructions allow explicit cache control
the L2 cache can be used as a scratchpad memory while remaining fullycoherent.
to prefetch data into the L1 and L2 caches
to control the eviction of cache lines by reducing their priority.
5.2 Intels Larrabee (9)
-
8/8/2019 GPUs DP Accelerators MSC
111/166
The Scalar Unit
supports the full ISA of the Pentium(it can run existing code including OS kernels and applications)
bit count
bit scan (it finds the next bit set within a register).
provides new instructions, e.g. for
Mask registers
5.2 Intels Larrabee (10)
-
8/8/2019 GPUs DP Accelerators MSC
112/166
Figure: Block diagram of the Vector Unit [31]
The Vector Unit
VU scatter-gather instructions
(load a VU vector register from16 non-contiguous data locationsfrom anywhere from the
on die L1 cache without penalty,or store a VU register similarly).
8-bit, 16-bit integer and 16 bit FPdata can be read from the L1 $or written into the L1 $,
with conversion to 32-bit integerswithout penalty.
Numeric conversions
L1 D$ becomesas an extension of theregister file
g
have one bit per bit lane,to control which bits of a vector reg.or memory data are read or writtenand which remain untouched.
5.2 Intels Larrabee (11)
-
8/8/2019 GPUs DP Accelerators MSC
113/166
Figure: Layout of the 16-wide vector ALU [31]
ALUs execute integer, SP and DP FP instructions
Multiply-add instructions are available.
ALUs
5.2 Intels Larrabee (12)
-
8/8/2019 GPUs DP Accelerators MSC
114/166
Task scheduling
performed entirely by software rather than by hardware, like in Nvidias or AMD/ATIsGPGPUs.
5.2 Intels Larrabee (13)
-
8/8/2019 GPUs DP Accelerators MSC
115/166
SP FP performance
2 operations/cycle16 ALUs
32 operations/core
At present no data available for the clock frequency or the number of cores in Larrabee.
Assuming a clock frequency of 2 GHz and 32 cores
SP FP performance: 2 TFLOPS
5.2 Intels Larrabee (14)
-
8/8/2019 GPUs DP Accelerators MSC
116/166
Figure: Larrabees software stack (Source Intel)
Larrabees Native C/C++ compiler allows many available apps to be recompiled and run
correctly with no modifications.
-
8/8/2019 GPUs DP Accelerators MSC
117/166
-
8/8/2019 GPUs DP Accelerators MSC
118/166
4. Overview of data parallel accelerators (13)
-
8/8/2019 GPUs DP Accelerators MSC
119/166
Price relations (as of 10/2008)
Nvidia Tesla
C870 ~ 1500 $
D870 ~ 5000 $
S870 ~ 7500 $
C1060 ~ 1600 $
S1070 ~ 8000 $
AMD/ATI FireStream
9170 ~ 800 $ 9250 ~ 800 $
-
8/8/2019 GPUs DP Accelerators MSC
120/166
5. Microarchitecture and operation
5.1 Nvidias GPGPU line
5.2 AMD/ATIs GPGPU line
5.3 Intels Larrabee
-
8/8/2019 GPUs DP Accelerators MSC
121/166
5.1 Nvidias GPGPU line
Microarchitecture of GPUs
5.1 Nvidias GPGPU line (1)
-
8/8/2019 GPUs DP Accelerators MSC
122/166
Microarchitecture of GPGPUs
3-levelmicroarchitectures
Two-level
microarchitectures
Dedicated microarchitecturesa priory developed to support
both graphics and HPC
Microarchitecturesinheriting the structure of
programmable GPUs
E.g. Nvidias and AMD/ATIsGPGPUs
IntelsLarrabee
Figure: Alternative layouts of microarchitectures of GPGPUs
Microarchitecture of GPUs
North Bridge Host memoryHost CPU
5.1 Nvidias GPGPU line (2)
-
8/8/2019 GPUs DP Accelerators MSC
123/166
Cores
L1 Cache
Cores
L1 Cache1 n
IN
L2
MC
L2
MC
Global Memory
Hub
Displayc.
PCI-E
x16IF
Work Schedeler
Command Processor Unit
Commands
CBA
Data
2x32-bit 2x32-bit
1 m
Simplified block diagram of recent 3-level GPUs/data-parallel accelerators
(Data parallel accelerators do not include Display controllers)
CB CBCB: Core Blocks
CBA: Core Block Array
IN: InterconnectionNetwork
MC: Memory Controller
5.1 Nvidias GPGPU line (3)
-
8/8/2019 GPUs DP Accelerators MSC
124/166
In these slides Nvidia AMD/ATI
C CoreSIMT Core
SM Streaming MultiprocesszorMultithreaded processor
Shader-processzorThread processor
CB Core Block TPC Texture Processor Cluster Multiprocessor
SIMD ArraySIMD EngineSIMD core
SIMD
CBA Core Block Array SPA Streaming Processor Array
ALU Algebraic Logic Unit Streaming Processor Thread ProcessorScalar ALU
Stream Processing UnitStream Processor
Table: Terminologies used with GPGPUs/Data parallel accelerators
Microarchitecture of Nvidias GPGPUs
5.1 Nvidias GPGPU line (4)
-
8/8/2019 GPUs DP Accelerators MSC
125/166
Microarchitecture of Nvidia s GPGPUs
GPGPUs based on 3-level microarchitectures
Nvidias line AMD/ATIs line
Figure: Overview of Nvidias and AMD/ATIs GPGPU lines
90 nm G80
65 nm G92 G200
Shrink Enhancedarch.
80 nm R600
55 nm RV670 RV770
Shrink Enhancedarch.
5.1 Nvidias GPGPU line (5)
-
8/8/2019 GPUs DP Accelerators MSC
126/166
G80/G92
Microarchitecture
5.1 Nvidias GPGPU line (6)
-
8/8/2019 GPUs DP Accelerators MSC
127/166
Figure: Overviewof the G80 [14]
5.1 Nvidias GPGPU line (7)
-
8/8/2019 GPUs DP Accelerators MSC
128/166
Figure: Overviewof the G92 [15]
5.1 Nvidias GPGPU line (8)
-
8/8/2019 GPUs DP Accelerators MSC
129/166
Figure: The Core Block of theG80/G92 [14], [15]
5.1 Nvidias GPGPU line (9)
-
8/8/2019 GPUs DP Accelerators MSC
130/166
Figure: Block diagramof G80/G92 cores
[14], [15]
Streaming Processors:SIMT ALUs
Individual components of the core
5.1 Nvidias GPGPU line (10)
-
8/8/2019 GPUs DP Accelerators MSC
131/166
8K registers (each 4 bytes wide) deliver
4 operands/clock
Load/Store pipe can also read/write RF
I$
L1
MultithreadedInstruction Buffer
RF
C$L1
SharedMem
Operand Select
MAD SFU
SM Register File (RF)
Figure: Register File [12]
Individual components of the core
Programmers view of the Register
5.1 Nvidias GPGPU line (11)
-
8/8/2019 GPUs DP Accelerators MSC
132/166
Programmer s view of the RegisterFile
There are 8192 and 16384 registers in each SM inthe G80 and the G200 resp.
This is an implementation decision, not part ofCUDA
4 thread blocks 3 thread blocks
Registers are dynamically partitioned acrossall thread blocks assigned to the SM
Once assigned to a thread block, the register isNOT accessible by threads in other blocks
Each thread in the same block only accessesregisters assigned to itself
Figure: The programmers view of the Register File [12]
The Constant
5.1 Nvidias GPGPU line (12)
-
8/8/2019 GPUs DP Accelerators MSC
133/166
The ConstantCache
Immediate address constants
Indexed address constants
Constants stored in DRAM, and cached on chip
L1 per SM
A constant value can be broadcast to all threads in a Warp
Extremely efficient way of accessing a value that is common for all
threads in a Block!
I$L1
MultithreadedInstruction Buffer
RF C$L1 SharedMem
Operand Select
MAD SFU
Figure: The constant cache [12]
Shared
5.1 Nvidias GPGPU line (13)
-
8/8/2019 GPUs DP Accelerators MSC
134/166
Memory
Each SM has 16 KB of Shared Memory
16 banks of 32 bit words
CUDA uses Shared Memory as shared storage visible
to all threads in a thread block
read and write access
Not used explicitly for pixel shader programs
I$L1
MultithreadedInstruction Buffer
RF C$L1 SharedMem
Operand Select
MAD SFU
Figure: Shared Memory [12]
A program needs to manage the global, constant and texture memory spaces
5.1 Nvidias GPGPU line (14)
-
8/8/2019 GPUs DP Accelerators MSC
135/166
A program needs to manage the global, constant and texture memory spacesvisible to kernels through calls to the CUDA runtime.
This includes memory allocation and deallocation as well as invoking data transfersbetween the CPU and GPU.
5.1 Nvidias GPGPU line (15)
-
8/8/2019 GPUs DP Accelerators MSC
136/166
Figure: Major functional blocks of G80/GT92 ALUs [14], [15]
Barrier synchronization
5.1 Nvidias GPGPU line (16)
-
8/8/2019 GPUs DP Accelerators MSC
137/166
synchronization is achieved by calling the void_syncthreads() intrinsic function [11];
used to coordinate memory accesses at synchronization points,
at synchronization points the execution of the threads is suspendeduntil all threads reach this point (barrier synchronization)
Principle of operation
5.1 Nvidias GPGPU line (17)
-
8/8/2019 GPUs DP Accelerators MSC
138/166
Based on Nvidias data parallel computing model
Nvidias data parallel computing model is specified at different levels ofabstraction
at the Instruction Set Architecture level (ISA) (not disclosed)
at the intermediate level (at the level ofAPIs) not discussed here)
at the high level programming language level by means of CUDA.
CUDA [11]
5.1 Nvidias GPGPU line (18)
-
8/8/2019 GPUs DP Accelerators MSC
139/166
programming language and programming environment that allows
explicit data parallel execution on an attached massively parallel device (GPGPU), its underlying principle is to allow the programmer to target portions ofthe
source code for execution on the GPGPU,
defined as a set of C-language extensions,
The key element of the language is the notion ofkernel
A kernel is specified by
5.1 Nvidias GPGPU line (19)
-
8/8/2019 GPUs DP Accelerators MSC
140/166
using the _global_declaration specifier,
a number of associated CUDA threads,
a domain of execution (grid, blocks) using the syntax
Execution of kernels
when called, a kernel is executed N times in parallel by N associated CUDA threads,as opposed to only once like in case of regular C functions.
Example
5.1 Nvidias GPGPU line (20)
-
8/8/2019 GPUs DP Accelerators MSC
141/166
adds two vectors A and B of size N and stores the result into vector C
Remark
The thread index threadIdx is a vector of up to 3-components,that identifies a one-, two- or three-dimensional thread block.
The above sample code
by executing the invoked threads (identified by a one dimensional index i)
in parallel on the attached massively parallel GPGPU, rather thanadding the vectors A and B by executing embedded loops on the conventional CPU.
h k l h d b h k b
5.1 Nvidias GPGPU line (21)
-
8/8/2019 GPUs DP Accelerators MSC
142/166
The kernel concept is enhanced by three key abstractions
the thread concept,
the memory concept and
the synchronization concept.
The thread concept
5.1 Nvidias GPGPU line (22)
-
8/8/2019 GPUs DP Accelerators MSC
143/166
based on a three level hierarchy of threads
grids
thread blocks
threads
The hierarchy of threads
5.1 Nvidias GPGPU line (23)
-
8/8/2019 GPUs DP Accelerators MSC
144/166
Each kernel invocationis executed as a grid of
thread blocks (Block(i,j))kernel0()
kernel1()
Host Device
Figure: Hierarchy ofthreads [25]
-
8/8/2019 GPUs DP Accelerators MSC
145/166
The memory concept
5.1 Nvidias GPGPU line (25)
-
8/8/2019 GPUs DP Accelerators MSC
146/166
private registers (R/W access)
per block shared memory (R/W access)
per grid global memory (R/W access)
per block constant memory (R access)
per TPC texture memory (R access)
Threads have
The global, constant and texturememory spaces can be read from orwritten to by the CPU and arepersistent across kernel launchesby the same application.
Shared memory is organized into banks(16 banks in version 1)
Figure: Memory concept [26] (revised)
Mapping of the memory spaces of the programming modelto the memory spaces of the streaming processor
5.1 Nvidias GPGPU line (26)
-
8/8/2019 GPUs DP Accelerators MSC
147/166
to the memory spaces of the streaming processor
Streaming Multiprocessor 1 (SM 1)
A thread block is scheduled for execution
to a particular multithreaded SM
An SM incorporates 8 Execution Units(designated a Processors in the figure)
SMs are the fundamentalprocessing units for CUDA thread blocks
Figure: Memory spaces of the SM [7]
The synchronization concept
5.1 Nvidias GPGPU line (27)
-
8/8/2019 GPUs DP Accelerators MSC
148/166
synchronization is achieved by the declaration void_syncthreads();
used to coordinate memory accesses at synchronization points,
at synchronization points the execution of the threads is suspendeduntil all threads reach this point (barrel synchronization)
Barrier synchronization
GT200
5.1 Nvidias GPGPU line (28)
-
8/8/2019 GPUs DP Accelerators MSC
149/166
5.1 Nvidias GPGPU line (29)
-
8/8/2019 GPUs DP Accelerators MSC
150/166
Figure: Block diagram of the GT200 [16]
5.1 Nvidias GPGPU line (30)
-
8/8/2019 GPUs DP Accelerators MSC
151/166
Figure: The Core Block of theGT200 [16]
5.1 Nvidias GPGPU line (31)
-
8/8/2019 GPUs DP Accelerators MSC
152/166
Figure: Block diagramof the GT200 cores [16]
Streaming Multi-processors:SIMT cores
5.1 Nvidias GPGPU line (32)
-
8/8/2019 GPUs DP Accelerators MSC
153/166
Figure: Major functional blocks of GT200 ALUs [16]
5.1 Nvidias GPGPU line (33)
-
8/8/2019 GPUs DP Accelerators MSC
154/166
Figure: Die shot of the GT 200 [17]
6. References
[1]: Torricelli F., AMD in HPC, HPC07,
6. References (1)
-
8/8/2019 GPUs DP Accelerators MSC
155/166
[2]: NVIDIA Tesla C870 GPU Computing Board, Board Specification, Jan. 2008, Nvidia
http://www.altairhyperworks.co.uk/html/en-GB/keynote2/Torricelli_AMD.pdf
[3] AMD FireStream 9170,http://ati.amd.com/technology/streamcomputing/product_firestream_9170.html
[4]: NVIDIA Tesla D870 Deskside GPU Computing System, System Specification, Jan. 2008,Nvidia,http://www.nvidia.com/docs/IO/43395/D870-SystemSpec-SP-03718-001_v01.pdf
[5]: Tesla S870 GPU Computing System, Specification, Nvida,http://jp.nvidia.com/docs/IO/43395/S870-BoardSpec_SP-03685-001_v00b.pdf
[6]: Torres G., Nvidia Tesla Technology, Nov. 2007,http://www.hardwaresecrets.com/article/495
[7]: R600-Family Instruction Set Architecture, Revision 0.31, May 2007, AMD
[8]: Zheng B., Gladding D., Villmow M., Building a High Level Language Compiler for GPGPU,ASPLOS 2006, June 2008
[9]: Huddy R., ATI Radeon HD2000 Series Technology Overview, AMD Technology Day, 2007http://ati.amd.com/developer/techpapers.html
[10]: Compute Abstraction Layer (CAL) Technology Intermediate Language (IL),Version 2.0, Oct. 2008, AMD
[11]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0,June 2008, Nvidia
6. References (2)
-
8/8/2019 GPUs DP Accelerators MSC
156/166
[12]: Kirk D. & Hwu W. W., ECE498AL Lectures 7: Threading Hardware in G80, 2007,University of Illinois, Urbana-Champaign, http://courses.ece.uiuc.edu/ece498/al1/
lectures/lecture7-threading%20hardware.ppt#256,1,ECE 498AL Lectures 7:Threading Hardware in G80
[13]: Kogo H., R600 (Radeon HD2900 XT), PC Watch, June 26 2008,http://pc.watch.impress.co.jp/docs/2008/0626/kaigai_3.pdf
[14]: Nvidia G80, Pc Watch, April 16 2007,http://pc.watch.impress.co.jp/docs/2007/0416/kaigai350.htm
[15]: GeForce 8800GT (G92), PC Watch, Oct. 31 2007,http://pc.watch.impress.co.jp/docs/2007/1031/kaigai398_07.pdf
[16]: NVIDIA GT200 and AMD RV770, PC Watch, July 2 2008, http://pc.watch.impress.co.jp/docs/2008/0702/kaigai451.htm
[17]: Shrout R., Nvidia GT200 Revealed GeForce GTX 280 and GTX 260 Review,PC Perspective, June 16 2008,
http://www.pcper.com/article.php?aid=577&type=expert&pid=3
[18]: http://en.wikipedia.org/wiki/DirectX
[19]: Dietrich S., Shader Model 3.0, April 2004, Nvidia,http://www.cs.umbc.edu/~olano/s2004c01/ch15.pdf
[20]: Microsoft DirectX 10: The Next-Generation Graphics API, Technical Brief, Nov. 2006,Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html
[21]: Patidar S. & al., Exploiting the Shader Model 4.0 Architecture, Center forVisual Information Technology, IIIT Hyderabad,
6. References (3)
-
8/8/2019 GPUs DP Accelerators MSC
157/166
http://research.iiit.ac.in/~shiben/docs/SM4_Skp-Shiben-Jag-PJN_draft.pdf
[22]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia,http://www.nvidia.com/page/8800_tech_briefs.html
[24]: Fatahalian K., From Shader Code to a Teraflop: How Shader Cores Work,
Workshop: Beyond Programmable Shading: Fundamentals, SIGGRAPH 2008,
[25]: Kanter D., NVIDIAs GT200: Inside a Parallel Processor, 09-08-2008
[23]: Graphics Pipeline Rendering History, Aug. 22 2008, PC Watch,http://pc.watch.impress.co.jp/docs/2008/0822/kaigai_06.pdf
[26]: Nvidia CUDA Compute Unified Device Architecture Programming Guide,Version 1.1, Nov. 2007, Nvidia
[27]: Seiler L. & al., Larrabee: A Many-Core x86 Architecture for Visual Computing,
ACM Transactions on Graphics, Vol. 27, No. 3, Article No. 18, Aug. 2008
[29]: Shrout R., IDF Fall 2007 Keynote, Sept. 18, 2007, PC Perspective,http://www.pcper.com/article.php?aid=453
[28]: Kogo H., Larrabee, PC Watch, Oct. 17, 2008,http://pc.watch.impress.co.jp/docs/2008/1017/kaigai472.htm
[30]: Stokes J., Larrabee: Intels biggest leap ahead since the Pentium Pro,Aug. 04. 2008, http://arstechnica.com/news.ars/post/20080804-larrabee-
6. References (4)
-
8/8/2019 GPUs DP Accelerators MSC
158/166
intels-biggest-leap-ahead-since-the-pentium-pro.html
[31]: Shimpi A. L. C Wilson D., Intel's Larrabee Architecture Disclosure: A CalculatedFirst Move, Anandtech, Aug. 4. 2008,http://www.anandtech.com/showdoc.aspx?i=3367&p=2
[32]: Hester P., Multi_Core and Beyond: Evolving the x86 Architecture, Hot Chips 19,Aug. 2007, http://www.hotchips.org/hc19/docs/keynote2.pdf
[33]: AMD Stream Computing, User Guide, Oct. 2008, Rev. 1.2.1
http://ati.amd.com/technology/streamcomputing/Stream_Computing_User_Guide.pdf
[34]: Doggett M., Radeon HD 2900, Graphics Hardware Conf. Aug. 2007,http://www.graphicshardware.org/previous/www_2007/presentations/doggett-radeon2900-gh07.pdf
[35]: Mantor M., AMDs Radeon Hd 2900, Hot Chips 19, Aug. 2007,
http://www.hotchips.org/archives/hc19/2_Mon/HC19.03/HC19.03.01.pdf
[36]: Houston M., Anatomy if AMDs TeraScale Graphics Engine,, SIGGRAPH 2008,http://s08.idav.ucdavis.edu/houston-amd-terascale.pdf
[37]: Mantor M., Entering the Golden Age of Heterogeneous Computing, PEEP 2008,http://ati.amd.com/technology/streamcomputing/IUCAA_Pune_PEEP_2008.pdf
[38]: Kogo H., RV770 Overview, PC Watch, July 02 2008,http://pc.watch.impress.co.jp/docs/2008/0702/kaigai_09.pdf
6. References (5)
-
8/8/2019 GPUs DP Accelerators MSC
159/166
AMD/ATI RV870 (Cypress) Radeon 5870 graphics card
6. References (5)
-
8/8/2019 GPUs DP Accelerators MSC
160/166
OpenCL 1.0 compliant
Introduction: Sept. 22 2009
Availability: now
Performance figures:
Engine clock speed: 850 MHz
SP FP performance: 2.72 TFLOPS
DP FP performance: 544 GFLOPS (1/5 of SP FP performance)
6. References (5)
Radeon 4800 series/5800 series comparison
-
8/8/2019 GPUs DP Accelerators MSC
161/166
ATI Radeon HD
4870
ATI Radeon HD
5850
ATI Radeon
HD 5870Manufacturing Process 55-nm 40-nm 40-nm
# of Transistors 956 million 2.15 billion 2.15 billion
Core Clock Speed 750MHz 725MHz 850MHz
# of Stream Processors 800 1440 1600
Compute Performance 1.2 TFLOPS 2.09 TFLOPS 2.72 TFLOPS
Memory Type GDDR5 GDDR5 GDDR5Memory Clock 900MHz 1000MHz 1200MHz
Memory Data Rate 3.6 Gbps 4.0 Gbps 4.8 Gbps
Memory Bandwidth 115.2 GB/sec 128 GB/sec 153.6 GB/sec
Max Board Power 160W 170W 188W
RV770-RV870 Comparison
6. References (5)
-
8/8/2019 GPUs DP Accelerators MSC
162/166
ATI Radeon HD
4870
ATI Radeon HD
5870
Difference
Die Size 263 mm2 334 mm2 1.27x
# of Transistors 956 million 2.15 billion 2.25x
# of Shaders 800 1600 2x
Board Power 90W idle, 160Wload 27W idle, 188Wmax 0.3x, 1.17x
6. References (5)
Architecture overview
-
8/8/2019 GPUs DP Accelerators MSC
163/166
8x32 = 256 bitGDDR5
153.6 GB/s
1600 ALUs
(Stream processing units)
8 cores
6. References (5)
The 5870 card
-
8/8/2019 GPUs DP Accelerators MSC
164/166
http://techreport.com/articles.x/17618/3
The 5870 card
6. References (5)
NVidias Fermi
Introduced: 30. Sept. 2009 at NVidias GPU Technology Conference Available: 1 Q 2010
-
8/8/2019 GPUs DP Accelerators MSC
165/166
Introduced: 30. Sept. 2009 at NVidia s GPU Technology Conference Available: 1 Q 2010
6. References (5)
Fermis overall structure
-
8/8/2019 GPUs DP Accelerators MSC
166/166
rt
NVidia: 16 cores(Streaming Multiprocessors)
Each core: 32 ALUs