High performance computing on the GPU: NVIDIA G80 and CUDA Won-Ki Jeong, Ross Whitaker SCI Institute...
-
Upload
britton-james -
Category
Documents
-
view
229 -
download
0
Transcript of High performance computing on the GPU: NVIDIA G80 and CUDA Won-Ki Jeong, Ross Whitaker SCI Institute...
High performance computing High performance computing on the GPU: NVIDIA G80 and on the GPU: NVIDIA G80 and
CUDACUDA
Won-Ki Jeong, Ross WhitakerWon-Ki Jeong, Ross Whitaker
SCI InstituteSCI Institute
University of UtahUniversity of Utah
GPGPUGPGPU• General Purpose computation on the General Purpose computation on the
GPUGPU– Started in computer graphics Started in computer graphics
communitycommunity– Mapping computation problems to Mapping computation problems to
graphics rendering pipelinegraphics rendering pipeline
Courtesy Jens Krueger and Aaron Lefohn
Why GPU for Computing?Why GPU for Computing?• GPU is fastGPU is fast
– Massively parallelMassively parallel• CPU : ~4 @ 3.0 Ghz (Intel Quad Core)CPU : ~4 @ 3.0 Ghz (Intel Quad Core)• GPU : ~128 @ 1.35 Ghz (Nvidia GeForce 8800 GTX)GPU : ~128 @ 1.35 Ghz (Nvidia GeForce 8800 GTX)
– High memory bandwidthHigh memory bandwidth• CPU : 21 GB/sCPU : 21 GB/s• GPU : GPU : 8686 GB/s GB/s
– Simple architecture optimized for compute intensive Simple architecture optimized for compute intensive tasktask
• ProgrammableProgrammable– Shaders, NVIDIA CUDA, ATI CTMShaders, NVIDIA CUDA, ATI CTM
• High precision floating point supportHigh precision floating point support– 32bit floating point IEEE 75432bit floating point IEEE 754– 64bit floating point will be available in early 200864bit floating point will be available in early 2008
Why GPU for computing?Why GPU for computing?• Inexpensive supercomputerInexpensive supercomputer
– Two NVIDIA Tesla D870 : 1 TFLOPSTwo NVIDIA Tesla D870 : 1 TFLOPS
• GPU hardware performance increases faster than CPUGPU hardware performance increases faster than CPU– Trend : simple, scalable architecture, interaction of clock Trend : simple, scalable architecture, interaction of clock
speed, cache, memory (bandwidth)speed, cache, memory (bandwidth)
GF
LOP
S G80GL = Quadro FX 5600
G80 = GeForce 8800 GTX
G71 = GeForce 7900 GTX
G70 = GeForce 7800 GTX
NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra
NV30 = GeForce FX 5800
Courtesy NVIDIA
GPU is for Parallel GPU is for Parallel ComputingComputing
• CPUCPU– Large cache and sophisticated flow control Large cache and sophisticated flow control
minimize latency for arbitrary memory access minimize latency for arbitrary memory access for serial processfor serial process
• GPUGPU– Simple flow control and limited cache, more Simple flow control and limited cache, more
transistors for computing in paralleltransistors for computing in parallel– High arithmetic intensity hides memory latencyHigh arithmetic intensity hides memory latency
DRAM
Cache
ALUControl
ALU
ALU
ALU
DRAM
CPU GPU
Courtesy NVIDIA
GPU-friendly ProblemsGPU-friendly Problems• High arithmetic intensityHigh arithmetic intensity
– Computation must offset memory latencyComputation must offset memory latency• Coherent data access (e.g. structured Coherent data access (e.g. structured
grids)grids)– Maximize memory bandwidthMaximize memory bandwidth
• Data-parallel processingData-parallel processing– Same computation over large datasets Same computation over large datasets
(SIMD)(SIMD)• E.g. convolution using a fixed kernel, PDEsE.g. convolution using a fixed kernel, PDEs• Jacobi updates (isolate data stream read and Jacobi updates (isolate data stream read and
write)write)
Traditional GPGPU ModelTraditional GPGPU Model• GPU as a streaming processor (SIMD)GPU as a streaming processor (SIMD)
– MemoryMemory• TexturesTextures
– Computation kernelComputation kernel• Vertex / fragment shadersVertex / fragment shaders
– ProgrammingProgramming• Graphics API (OpenGL, DirectX), Cg, HLSLGraphics API (OpenGL, DirectX), Cg, HLSL
• ExampleExample– Render a screen-sized quad with a Render a screen-sized quad with a
texture mapping using a fragment shadertexture mapping using a fragment shader
Graphics PipelineGraphics Pipeline
Vertex Processor
Fragment Processor
Rasterizer
Framebuffer
Texture
Problems of Traditional GPGPU Problems of Traditional GPGPU ModelModel
• Software limitationSoftware limitation– High learning curveHigh learning curve– Graphics API overheadGraphics API overhead– Inconsistency in APIInconsistency in API– Debugging is difficultDebugging is difficult
• Hardware limitationHardware limitation– No general memory access (no scatter)No general memory access (no scatter)
• B = A[i] : gather (O) B = A[i] : gather (O) • A[i] = B : scatter (X)A[i] = B : scatter (X)
– No integer/bitwise operationsNo integer/bitwise operations– Memory access can be bottleneckMemory access can be bottleneck
• Need coherent memory access for cache Need coherent memory access for cache performanceperformance
NVIDIA G80 and CUDANVIDIA G80 and CUDA• New HW/SW architecture for computing on New HW/SW architecture for computing on
the GPUthe GPU– GPU as massively parallel multithreaded GPU as massively parallel multithreaded
machine : one step further from streaming machine : one step further from streaming modelmodel
– New hardware featuresNew hardware features• Unified shaders (ALUs)Unified shaders (ALUs)• Flexible memory accessFlexible memory access• Fast user-controllable on-chip memoryFast user-controllable on-chip memory• Integer, bitwise operationsInteger, bitwise operations
– New software featuresNew software features• Extended C programming language and compilerExtended C programming language and compiler• Support debugging option (through emulation)Support debugging option (through emulation)
GPU : Highly Parallel GPU : Highly Parallel CoprocessorCoprocessor
• GPU as a coprocessor thatGPU as a coprocessor that– Has its own DRAM memoryHas its own DRAM memory– Communicate with host (CPU) through Communicate with host (CPU) through
bus (PCIx)bus (PCIx)– Runs many threads in Runs many threads in parallelparallel
• GPU threadsGPU threads– GPU threads are extremely lightweight GPU threads are extremely lightweight
(almost no cost for creation/context (almost no cost for creation/context switch)switch)
– GPU needs at least several thousands GPU needs at least several thousands threads for full efficiencythreads for full efficiency
Programming Model: Programming Model: SPMD + SIMD SPMD + SIMD
• HierarchyHierarchy– Device = GridsDevice = Grids– Grid = BlocksGrid = Blocks– Block = WarpsBlock = Warps– Warp = ThreadsWarp = Threads
• Single kernel runs on Single kernel runs on multiple blocks (SPMD)multiple blocks (SPMD)
• Single instruction executed Single instruction executed on multiple threads (SIMD)on multiple threads (SIMD)– Warp size determines SIMD Warp size determines SIMD
granularity (G80 : 32 granularity (G80 : 32 threads)threads)
• Synchronization within a Synchronization within a block using shared block using shared memorymemory
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Courtesy NVIDIA
Hardware Implementation : Hardware Implementation : a set of SIMD Processorsa set of SIMD Processors
• DDeviceevice – a set of a set of
mmultiprocessorsultiprocessors
• MultiprocessorMultiprocessor– a set of 32-bit a set of 32-bit SIMD SIMD
processors processors
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
InstructionUnit
Processor 1 …Processor 2 Processor M
Courtesy NVIDIA
Memory ModelMemory Model• Each thread can:Each thread can:
– Read/write per-thread Read/write per-thread registersregisters– Read/write per-thread Read/write per-thread local local
memorymemory– Read/write per-block Read/write per-block shared shared
memorymemory– Read/write per-grid Read/write per-grid global global
memorymemory– Read only per-grid Read only per-grid constant constant
memorymemory– Read only per-grid Read only per-grid texture texture
memorymemory
• The host can read/write The host can read/write global, constant, and texture global, constant, and texture memorymemory
Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
Courtesy NVIDIA
Hardware Implementation : Hardware Implementation : Memory ArchitectureMemory Architecture
• Device memory Device memory (DRAM)(DRAM)– Slow (2~300 cycles)Slow (2~300 cycles)– LLocal, global, constant, ocal, global, constant,
and texture and texture memorymemory
• On-chip memoryOn-chip memory– Fast (1 cycle)Fast (1 cycle)– Registers, shared Registers, shared
memory, memory, constant/texture cacheconstant/texture cache
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Device memory
Shared Memory
InstructionUnit
Processor 1
Registers
…Processor 2
Registers
Processor M
Registers
ConstantCache
TextureCache
Courtesy NVIDIA
Memory Access StrategyMemory Access Strategy
Copy data from global to shared memory
Synchronization
Computation (iteration)
Synchronization
Copy data from shared to global memory
Execution ModelExecution Model• Each thread block is executed by a single Each thread block is executed by a single
multiprocessormultiprocessor– Synchronized using shared memorySynchronized using shared memory
• Many thread blocks are assigned to a Many thread blocks are assigned to a single multiprocessorsingle multiprocessor– Executed concurrently in a time-sharing Executed concurrently in a time-sharing
fashionfashion– Keep GPU as busy as possible Keep GPU as busy as possible
• Running many threads in parallel can hide Running many threads in parallel can hide DRAM memory latencyDRAM memory latency– Global memory access : 2~300 cyclesGlobal memory access : 2~300 cycles
CUDACUDA• C-extension programming languageC-extension programming language
– No graphics APINo graphics API• Flattens learning curveFlattens learning curve• Better performance Better performance
– Support debugging toolsSupport debugging tools
• Extensions / APIExtensions / API– Function type : __global__, __device__, __host__Function type : __global__, __device__, __host__– Variable type : __shared__, __constant__Variable type : __shared__, __constant__– cudaMalloc(), cudaFree(), cudaMemcpy(),…cudaMalloc(), cudaFree(), cudaMemcpy(),…– __syncthread(), atomicAdd(),…__syncthread(), atomicAdd(),…
• Program typesProgram types– DeviceDevice program (kernel) : run on the GPU program (kernel) : run on the GPU– HostHost program : run on the CPU to call device programs program : run on the CPU to call device programs
Example: Vector Addition Example: Vector Addition KernelKernel
// Pair-wise addition of vector elements// Pair-wise addition of vector elements// One thread per addition// One thread per addition
__global__ void__global__ voidvectorAdd(float* iA, float* iB, float* oC) vectorAdd(float* iA, float* iB, float* oC) {{ int idx = threadIdx.x int idx = threadIdx.x + blockDim.x * blockId.x;+ blockDim.x * blockId.x; oC[idx] = iA[idx] + iB[idx];oC[idx] = iA[idx] + iB[idx];}}
Courtesy NVIDIA
Example: Vector Addition Host Example: Vector Addition Host CodeCode
float* h_A = (float*) malloc(N * sizeof(float));float* h_A = (float*) malloc(N * sizeof(float));float* h_B = (float*) malloc(N * sizeof(float));float* h_B = (float*) malloc(N * sizeof(float));// … initalize h_A and h_B// … initalize h_A and h_B
// allocate device memory// allocate device memoryfloat* d_A, d_B, d_C;float* d_A, d_B, d_C;cudaMalloccudaMalloc( (void**) &d_A, N * sizeof(float)( (void**) &d_A, N * sizeof(float) ););cudaMalloccudaMalloc( (void**) &d_B, N * sizeof(float)( (void**) &d_B, N * sizeof(float) ););cudaMalloccudaMalloc( (void**) &d_C, N * sizeof(float)( (void**) &d_C, N * sizeof(float) ););
// copy host memory to device// copy host memory to devicecudaMemcpycudaMemcpy( d_A, h_A, N * sizeof(float),( d_A, h_A, N * sizeof(float),
cudaMemcpyHostToDevicecudaMemcpyHostToDevice ); );cudaMemcpycudaMemcpy( d_B, h_B, N * sizeof(float), ( d_B, h_B, N * sizeof(float),
cudaMemcpyHostToDeviccudaMemcpyHostToDevicee ); );
// execute the kernel on N/256 blocks of 256 threads each// execute the kernel on N/256 blocks of 256 threads eachvectorAdd<<< N/256, 256>>>( d_A, d_B, d_C);vectorAdd<<< N/256, 256>>>( d_A, d_B, d_C);
Courtesy NVIDIA
Compiling CUDACompiling CUDA• nvccnvcc
– Compiler driverCompiler driver– Invoke cudacc, g++, clInvoke cudacc, g++, cl
• PTXPTX– Parallel Thread eXecutionParallel Thread eXecution
NVCC
C/C++ CUDAApplication
PTX to TargetCompiler
G80 … GPU
Target code
PTX Code
CPU Code
ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];mad.f32 $f1, $f5, $f3, $f1;
Courtesy NVIDIA
DebuggingDebugging• Emulation modeEmulation mode
– CUDA code can be compiled and run in CUDA code can be compiled and run in emulation mode (emulation mode (nvcc nvcc ––deviceemudeviceemu))
– No need of device or driverNo need of device or driver– Each device thread is emulated with a host Each device thread is emulated with a host
threadthread– Can call host function from device code (e.g., Can call host function from device code (e.g.,
printf)printf)– Support host debug function (breakpoint, Support host debug function (breakpoint,
inspection, etc)inspection, etc)
• Hardware debug will be available late 2007Hardware debug will be available late 2007
Optimization TipsOptimization Tips• Avoid shared memory bank conflictAvoid shared memory bank conflict
– Shared memory space is split into 16 Shared memory space is split into 16 banksbanks• Each bank is 4 bytes (32bit) wideEach bank is 4 bytes (32bit) wide• Assigned round-robin fashionAssigned round-robin fashion
– Any non-overlapped parallel bank access Any non-overlapped parallel bank access can be done by a single memory operationcan be done by a single memory operation
• Coalesced global memory accessCoalesced global memory access– ContiguousContiguous memory address is fast memory address is fast
• a = b[thread_id]; // coalesceda = b[thread_id]; // coalesced• a = b[2*thread_id]; // non-coalesceda = b[2*thread_id]; // non-coalesced
CUDA Enabled GPUs / OSCUDA Enabled GPUs / OS• Supported OSSupported OS
– MS Windows, LinuxMS Windows, Linux
• Supported HWSupported HW– NVIDIA GeForce 8800 seriesNVIDIA GeForce 8800 series– NVIDIA Quadro 5600/4600NVIDIA Quadro 5600/4600– NVIDIA Tesla seriesNVIDIA Tesla series
Courtesy NVIDIA
ATI CTM (Close To Metal)ATI CTM (Close To Metal)• Similar to CUDA PTXSimilar to CUDA PTX
– A set of native device instructionsA set of native device instructions
• No compiler supportNo compiler support• Limited programming environmentLimited programming environment
Example: Fast Iterative MethodExample: Fast Iterative Method
• CUDA implementationCUDA implementation– Tile size : 4x4x4Tile size : 4x4x4– Update active tileUpdate active tile
• Neighbor access Neighbor access
– Manage active listManage active list• Parallel reduction Parallel reduction
Coalesced Global Memory Coalesced Global Memory AccessAccess
• ReorderingReordering– Each tile is stored in global memory in Each tile is stored in global memory in
contiguous memory spacecontiguous memory space
Non-coalesced Coalesced
Update Active TileUpdate Active Tile• Compute new solution Compute new solution
– Copy a tile and its neighbor pixels to Copy a tile and its neighbor pixels to shared memoryshared memory
– Avoid bank conflictAvoid bank conflict
Tile (yellow) Left Right Top Bottom
Manage Active ListManage Active List• Active listActive list
– Simple 1D integer array of active tile Simple 1D integer array of active tile indicesindices
– Need to know which tile is NOT Need to know which tile is NOT convergedconverged
Active points Active tiles
{1,2,5,7,8, 9,11,13,14}
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Parallel Reduction of Parallel Reduction of Convergence in TilesConvergence in Tiles
T T T TF F F F
T F F F
F F
F
Tile (1D view)
T = ConvergedF = Not converged
Wrap upWrap up• GPUGPU computing is promising computing is promising
– Many scientific computing problems are parallelizableMany scientific computing problems are parallelizable– More consistency/stability in HW/SWMore consistency/stability in HW/SW
• Streaming architectures are here to stay (and more so)Streaming architectures are here to stay (and more so)• Industry trend is multi/many core processorIndustry trend is multi/many core processor
– Better support/tools (easier to learn, maintain)Better support/tools (easier to learn, maintain)
• IssuesIssues– No industry-wide standardNo industry-wide standard– Market driven by gaming industryMarket driven by gaming industry– Not every problem is suitableNot every problem is suitable for GPUs for GPUs– Re-engineer algorithms/softwareRe-engineer algorithms/software– Future performance growth????Future performance growth????
• Impact on the data-analysis/interpretation Impact on the data-analysis/interpretation workflowworkflow
Questions?Questions?