High performance computing on the GPU: NVIDIA G80 and CUDA Won-Ki Jeong, Ross Whitaker SCI Institute...

High performance computing High performance computing on the GPU: NVIDIA G80 and on the GPU: NVIDIA G80 and

CUDACUDA

Won-Ki Jeong, Ross WhitakerWon-Ki Jeong, Ross Whitaker

SCI InstituteSCI Institute

University of UtahUniversity of Utah

GPGPUGPGPU• General Purpose computation on the General Purpose computation on the

GPUGPU– Started in computer graphics Started in computer graphics

communitycommunity– Mapping computation problems to Mapping computation problems to

graphics rendering pipelinegraphics rendering pipeline

Courtesy Jens Krueger and Aaron Lefohn

Why GPU for Computing?Why GPU for Computing?• GPU is fastGPU is fast

– Massively parallelMassively parallel• CPU : ~4 @ 3.0 Ghz (Intel Quad Core)CPU : ~4 @ 3.0 Ghz (Intel Quad Core)• GPU : ~128 @ 1.35 Ghz (Nvidia GeForce 8800 GTX)GPU : ~128 @ 1.35 Ghz (Nvidia GeForce 8800 GTX)

– High memory bandwidthHigh memory bandwidth• CPU : 21 GB/sCPU : 21 GB/s• GPU : GPU : 8686 GB/s GB/s

– Simple architecture optimized for compute intensive Simple architecture optimized for compute intensive tasktask

• ProgrammableProgrammable– Shaders, NVIDIA CUDA, ATI CTMShaders, NVIDIA CUDA, ATI CTM

• High precision floating point supportHigh precision floating point support– 32bit floating point IEEE 75432bit floating point IEEE 754– 64bit floating point will be available in early 200864bit floating point will be available in early 2008

Why GPU for computing?Why GPU for computing?• Inexpensive supercomputerInexpensive supercomputer

– Two NVIDIA Tesla D870 : 1 TFLOPSTwo NVIDIA Tesla D870 : 1 TFLOPS

• GPU hardware performance increases faster than CPUGPU hardware performance increases faster than CPU– Trend : simple, scalable architecture, interaction of clock Trend : simple, scalable architecture, interaction of clock

speed, cache, memory (bandwidth)speed, cache, memory (bandwidth)

GF

LOP

S G80GL = Quadro FX 5600

G80 = GeForce 8800 GTX



NV40 = GeForce 6800 Ultra

NV35 = GeForce FX 5950 Ultra

NV30 = GeForce FX 5800

Courtesy NVIDIA

GPU is for Parallel GPU is for Parallel ComputingComputing

• CPUCPU– Large cache and sophisticated flow control Large cache and sophisticated flow control

minimize latency for arbitrary memory access minimize latency for arbitrary memory access for serial processfor serial process

• GPUGPU– Simple flow control and limited cache, more Simple flow control and limited cache, more

transistors for computing in paralleltransistors for computing in parallel– High arithmetic intensity hides memory latencyHigh arithmetic intensity hides memory latency

DRAM

Cache

ALUControl

ALU

ALU

ALU

DRAM

CPU GPU

Courtesy NVIDIA

GPU-friendly ProblemsGPU-friendly Problems• High arithmetic intensityHigh arithmetic intensity

– Computation must offset memory latencyComputation must offset memory latency• Coherent data access (e.g. structured Coherent data access (e.g. structured

grids)grids)– Maximize memory bandwidthMaximize memory bandwidth

• Data-parallel processingData-parallel processing– Same computation over large datasets Same computation over large datasets

(SIMD)(SIMD)• E.g. convolution using a fixed kernel, PDEsE.g. convolution using a fixed kernel, PDEs• Jacobi updates (isolate data stream read and Jacobi updates (isolate data stream read and

write)write)

Traditional GPGPU ModelTraditional GPGPU Model• GPU as a streaming processor (SIMD)GPU as a streaming processor (SIMD)

– MemoryMemory• TexturesTextures

– Computation kernelComputation kernel• Vertex / fragment shadersVertex / fragment shaders

– ProgrammingProgramming• Graphics API (OpenGL, DirectX), Cg, HLSLGraphics API (OpenGL, DirectX), Cg, HLSL

• ExampleExample– Render a screen-sized quad with a Render a screen-sized quad with a

texture mapping using a fragment shadertexture mapping using a fragment shader

Graphics PipelineGraphics Pipeline

Vertex Processor

Fragment Processor

Rasterizer

Framebuffer

Texture

Problems of Traditional GPGPU Problems of Traditional GPGPU ModelModel

• Software limitationSoftware limitation– High learning curveHigh learning curve– Graphics API overheadGraphics API overhead– Inconsistency in APIInconsistency in API– Debugging is difficultDebugging is difficult

• Hardware limitationHardware limitation– No general memory access (no scatter)No general memory access (no scatter)

• B = A[i] : gather (O) B = A[i] : gather (O) • A[i] = B : scatter (X)A[i] = B : scatter (X)

– No integer/bitwise operationsNo integer/bitwise operations– Memory access can be bottleneckMemory access can be bottleneck

• Need coherent memory access for cache Need coherent memory access for cache performanceperformance

NVIDIA G80 and CUDANVIDIA G80 and CUDA• New HW/SW architecture for computing on New HW/SW architecture for computing on

the GPUthe GPU– GPU as massively parallel multithreaded GPU as massively parallel multithreaded

machine : one step further from streaming machine : one step further from streaming modelmodel

– New hardware featuresNew hardware features• Unified shaders (ALUs)Unified shaders (ALUs)• Flexible memory accessFlexible memory access• Fast user-controllable on-chip memoryFast user-controllable on-chip memory• Integer, bitwise operationsInteger, bitwise operations

– New software featuresNew software features• Extended C programming language and compilerExtended C programming language and compiler• Support debugging option (through emulation)Support debugging option (through emulation)

GPU : Highly Parallel GPU : Highly Parallel CoprocessorCoprocessor

• GPU as a coprocessor thatGPU as a coprocessor that– Has its own DRAM memoryHas its own DRAM memory– Communicate with host (CPU) through Communicate with host (CPU) through

bus (PCIx)bus (PCIx)– Runs many threads in Runs many threads in parallelparallel

• GPU threadsGPU threads– GPU threads are extremely lightweight GPU threads are extremely lightweight

(almost no cost for creation/context (almost no cost for creation/context switch)switch)

– GPU needs at least several thousands GPU needs at least several thousands threads for full efficiencythreads for full efficiency

Programming Model: Programming Model: SPMD + SIMD SPMD + SIMD

• HierarchyHierarchy– Device = GridsDevice = Grids– Grid = BlocksGrid = Blocks– Block = WarpsBlock = Warps– Warp = ThreadsWarp = Threads

• Single kernel runs on Single kernel runs on multiple blocks (SPMD)multiple blocks (SPMD)

• Single instruction executed Single instruction executed on multiple threads (SIMD)on multiple threads (SIMD)– Warp size determines SIMD Warp size determines SIMD

granularity (G80 : 32 granularity (G80 : 32 threads)threads)

• Synchronization within a Synchronization within a block using shared block using shared memorymemory

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Courtesy NVIDIA

Hardware Implementation : Hardware Implementation : a set of SIMD Processorsa set of SIMD Processors

• DDeviceevice – a set of a set of

mmultiprocessorsultiprocessors

• MultiprocessorMultiprocessor– a set of 32-bit a set of 32-bit SIMD SIMD

processors processors

Device

Multiprocessor N

Multiprocessor 2

Multiprocessor 1

InstructionUnit

Processor 1 …Processor 2 Processor M

Courtesy NVIDIA

Memory ModelMemory Model• Each thread can:Each thread can:

– Read/write per-thread Read/write per-thread registersregisters– Read/write per-thread Read/write per-thread local local

memorymemory– Read/write per-block Read/write per-block shared shared

memorymemory– Read/write per-grid Read/write per-grid global global

memorymemory– Read only per-grid Read only per-grid constant constant

memorymemory– Read only per-grid Read only per-grid texture texture

memorymemory

• The host can read/write The host can read/write global, constant, and texture global, constant, and texture memorymemory

Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

Courtesy NVIDIA

Hardware Implementation : Hardware Implementation : Memory ArchitectureMemory Architecture

• Device memory Device memory (DRAM)(DRAM)– Slow (2~300 cycles)Slow (2~300 cycles)– LLocal, global, constant, ocal, global, constant,

and texture and texture memorymemory

• On-chip memoryOn-chip memory– Fast (1 cycle)Fast (1 cycle)– Registers, shared Registers, shared

memory, memory, constant/texture cacheconstant/texture cache

Device

Multiprocessor N

Multiprocessor 2

Multiprocessor 1

Device memory

Shared Memory

InstructionUnit

Processor 1

Registers

…Processor 2

Registers

Processor M

Registers

ConstantCache

TextureCache

Courtesy NVIDIA

Memory Access StrategyMemory Access Strategy

Copy data from global to shared memory

Synchronization

Computation (iteration)

Synchronization

Copy data from shared to global memory

Execution ModelExecution Model• Each thread block is executed by a single Each thread block is executed by a single

multiprocessormultiprocessor– Synchronized using shared memorySynchronized using shared memory

• Many thread blocks are assigned to a Many thread blocks are assigned to a single multiprocessorsingle multiprocessor– Executed concurrently in a time-sharing Executed concurrently in a time-sharing

fashionfashion– Keep GPU as busy as possible Keep GPU as busy as possible

• Running many threads in parallel can hide Running many threads in parallel can hide DRAM memory latencyDRAM memory latency– Global memory access : 2~300 cyclesGlobal memory access : 2~300 cycles

CUDACUDA• C-extension programming languageC-extension programming language

– No graphics APINo graphics API• Flattens learning curveFlattens learning curve• Better performance Better performance

– Support debugging toolsSupport debugging tools

• Extensions / APIExtensions / API– Function type : __global__, __device__, __host__Function type : __global__, __device__, __host__– Variable type : __shared__, __constant__Variable type : __shared__, __constant__– cudaMalloc(), cudaFree(), cudaMemcpy(),…cudaMalloc(), cudaFree(), cudaMemcpy(),…– __syncthread(), atomicAdd(),…__syncthread(), atomicAdd(),…

• Program typesProgram types– DeviceDevice program (kernel) : run on the GPU program (kernel) : run on the GPU– HostHost program : run on the CPU to call device programs program : run on the CPU to call device programs

Example: Vector Addition Example: Vector Addition KernelKernel

// Pair-wise addition of vector elements// Pair-wise addition of vector elements// One thread per addition// One thread per addition

__global__ void__global__ voidvectorAdd(float* iA, float* iB, float* oC) vectorAdd(float* iA, float* iB, float* oC) {{ int idx = threadIdx.x int idx = threadIdx.x + blockDim.x * blockId.x;+ blockDim.x * blockId.x; oC[idx] = iA[idx] + iB[idx];oC[idx] = iA[idx] + iB[idx];}}

Courtesy NVIDIA

Example: Vector Addition Host Example: Vector Addition Host CodeCode

float* h_A = (float*) malloc(N * sizeof(float));float* h_A = (float*) malloc(N * sizeof(float));float* h_B = (float*) malloc(N * sizeof(float));float* h_B = (float*) malloc(N * sizeof(float));// … initalize h_A and h_B// … initalize h_A and h_B

// allocate device memory// allocate device memoryfloat* d_A, d_B, d_C;float* d_A, d_B, d_C;cudaMalloccudaMalloc( (void**) &d_A, N * sizeof(float)( (void**) &d_A, N * sizeof(float) ););cudaMalloccudaMalloc( (void**) &d_B, N * sizeof(float)( (void**) &d_B, N * sizeof(float) ););cudaMalloccudaMalloc( (void**) &d_C, N * sizeof(float)( (void**) &d_C, N * sizeof(float) ););

// copy host memory to device// copy host memory to devicecudaMemcpycudaMemcpy( d_A, h_A, N * sizeof(float),( d_A, h_A, N * sizeof(float),

cudaMemcpyHostToDevicecudaMemcpyHostToDevice ); );cudaMemcpycudaMemcpy( d_B, h_B, N * sizeof(float), ( d_B, h_B, N * sizeof(float),

cudaMemcpyHostToDeviccudaMemcpyHostToDevicee ); );

// execute the kernel on N/256 blocks of 256 threads each// execute the kernel on N/256 blocks of 256 threads eachvectorAdd<<< N/256, 256>>>( d_A, d_B, d_C);vectorAdd<<< N/256, 256>>>( d_A, d_B, d_C);

Courtesy NVIDIA

Compiling CUDACompiling CUDA• nvccnvcc

– Compiler driverCompiler driver– Invoke cudacc, g++, clInvoke cudacc, g++, cl

• PTXPTX– Parallel Thread eXecutionParallel Thread eXecution

NVCC

C/C++ CUDAApplication

PTX to TargetCompiler

G80 … GPU

Target code

PTX Code

CPU Code

ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];mad.f32 $f1, $f5, $f3, $f1;

Courtesy NVIDIA

DebuggingDebugging• Emulation modeEmulation mode

– CUDA code can be compiled and run in CUDA code can be compiled and run in emulation mode (emulation mode (nvcc nvcc ––deviceemudeviceemu))

– No need of device or driverNo need of device or driver– Each device thread is emulated with a host Each device thread is emulated with a host

threadthread– Can call host function from device code (e.g., Can call host function from device code (e.g.,

printf)printf)– Support host debug function (breakpoint, Support host debug function (breakpoint,

inspection, etc)inspection, etc)

• Hardware debug will be available late 2007Hardware debug will be available late 2007

Optimization TipsOptimization Tips• Avoid shared memory bank conflictAvoid shared memory bank conflict

– Shared memory space is split into 16 Shared memory space is split into 16 banksbanks• Each bank is 4 bytes (32bit) wideEach bank is 4 bytes (32bit) wide• Assigned round-robin fashionAssigned round-robin fashion

– Any non-overlapped parallel bank access Any non-overlapped parallel bank access can be done by a single memory operationcan be done by a single memory operation

• Coalesced global memory accessCoalesced global memory access– ContiguousContiguous memory address is fast memory address is fast

• a = b[thread_id]; // coalesceda = b[thread_id]; // coalesced• a = b[2*thread_id]; // non-coalesceda = b[2*thread_id]; // non-coalesced

CUDA Enabled GPUs / OSCUDA Enabled GPUs / OS• Supported OSSupported OS

– MS Windows, LinuxMS Windows, Linux

• Supported HWSupported HW– NVIDIA GeForce 8800 seriesNVIDIA GeForce 8800 series– NVIDIA Quadro 5600/4600NVIDIA Quadro 5600/4600– NVIDIA Tesla seriesNVIDIA Tesla series

Courtesy NVIDIA

ATI CTM (Close To Metal)ATI CTM (Close To Metal)• Similar to CUDA PTXSimilar to CUDA PTX

– A set of native device instructionsA set of native device instructions

• No compiler supportNo compiler support• Limited programming environmentLimited programming environment

Example: Fast Iterative MethodExample: Fast Iterative Method

• CUDA implementationCUDA implementation– Tile size : 4x4x4Tile size : 4x4x4– Update active tileUpdate active tile

• Neighbor access Neighbor access

– Manage active listManage active list• Parallel reduction Parallel reduction

Coalesced Global Memory Coalesced Global Memory AccessAccess

• ReorderingReordering– Each tile is stored in global memory in Each tile is stored in global memory in

contiguous memory spacecontiguous memory space

Non-coalesced Coalesced

Update Active TileUpdate Active Tile• Compute new solution Compute new solution

– Copy a tile and its neighbor pixels to Copy a tile and its neighbor pixels to shared memoryshared memory

– Avoid bank conflictAvoid bank conflict

Tile (yellow) Left Right Top Bottom

Manage Active ListManage Active List• Active listActive list

– Simple 1D integer array of active tile Simple 1D integer array of active tile indicesindices

– Need to know which tile is NOT Need to know which tile is NOT convergedconverged

Active points Active tiles

{1,2,5,7,8, 9,11,13,14}

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Parallel Reduction of Parallel Reduction of Convergence in TilesConvergence in Tiles

T T T TF F F F

T F F F

F F

F

Tile (1D view)

T = ConvergedF = Not converged

Wrap upWrap up• GPUGPU computing is promising computing is promising

– Many scientific computing problems are parallelizableMany scientific computing problems are parallelizable– More consistency/stability in HW/SWMore consistency/stability in HW/SW

• Streaming architectures are here to stay (and more so)Streaming architectures are here to stay (and more so)• Industry trend is multi/many core processorIndustry trend is multi/many core processor

– Better support/tools (easier to learn, maintain)Better support/tools (easier to learn, maintain)

• IssuesIssues– No industry-wide standardNo industry-wide standard– Market driven by gaming industryMarket driven by gaming industry– Not every problem is suitableNot every problem is suitable for GPUs for GPUs– Re-engineer algorithms/softwareRe-engineer algorithms/software– Future performance growth????Future performance growth????

• Impact on the data-analysis/interpretation Impact on the data-analysis/interpretation workflowworkflow

Questions?Questions?

High performance computing on the GPU: NVIDIA G80 and CUDA Won-Ki Jeong, Ross Whitaker SCI Institute...

Documents

Transcript of High performance computing on the GPU: NVIDIA G80 and CUDA Won-Ki Jeong, Ross Whitaker SCI Institute...