Gpu IEEE Paper

download Gpu IEEE Paper

of 14

Transcript of Gpu IEEE Paper

  • 8/13/2019 Gpu IEEE Paper

    1/14

    ..........................................................................................................................................................................................................

    THE GPU COMPUTING ERA..........................................................................................................................................................................................................

    GPUCOMPUTING IS AT A TIPPING POINT, BECOMING MORE WIDELY USED IN DEMANDING

    CONSUMER APPLICATIONS AND HIGH-PERFORMANCE COMPUTING. THIS ARTICLE

    DESCRIBES THE RAPID EVOLUTION OFGPU ARCHITECTURESFROM GRAPHICS

    PROCESSORS TO MASSIVELY PARALLEL MANY-CORE MULTIPROCESSORS, RECENT

    DEVELOPMENTS IN GPU COMPUTING ARCHITECTURES, AND HOW THE ENTHUSIASTIC

    ADOPTION OF CPUGPU COPROCESSING IS ACCELERATING PARALLEL APPLICATIONS.

    ......As we ent er the era of GPUcomputing, demanding applications withsubstantial parallelism increasingly use themassively parallel computing capabilitiesof GPUs to achieve superior performanceand efficiency. Today GPU computingenables applications that we previouslythought infeasible because of long execu-

    tion times.With the GPUs rapid evolution from a

    configurable graphics processor to a pro-grammable parallel processor, the ubiqui-tous GPU in every PC, laptop, desktop,and workstation is a many-core multi-threaded multiprocessor that excels atboth graphics and computing applications.Todays GPUs use hundreds of parallelprocessor cores executing tens of thou-sands of parallel threads to rapidly solvelarge problems having substantial inherent

    parallelism. Theyre now the most perva-sive massively parallel processing platformever available, as well as the most cost-effective.

    Using NVIDIA GPUs as examples, thisarticle describes the evolution of GPU com-puting and its parallel computing model, theenabling architecture and software develop-ments, how computing applications useCPUGPU coprocessing, example applica-tion performance speedups, and trends inGPU computing.

    GPU computings evolutionWhy have GPUs evolved to have large

    numbers of parallel threads and manycores? The driving force continues to bethe real-time graphics performance neededto render complex, high-resolution 3Dscenes at interactive frame rates for games.

    Rendering high-definition graphics scenes

    is a problem with tremendous inherent par-allelism. A graphics programmer writes asingle-thread program that draws one pixel, andthe GPU runs multiple instances of thisthread in paralleldrawing multiple pixelsin parallel. Graphics programs, written inshading languages such as Cg or High-Level Shading Language (HLSL), thus scaletransparently over a wide range of threadand processor parallelism. Also, GPU com-puting programswritten in C or Cwith the CUDA parallel computing

    model,1,2 or using a parallel computingAPI inspired by CUDA such as Direct-Compute3 or OpenCL4scale transparentlyover a wide range of parallelism. Softwarescalability, too, has enabled GPUs torapidly increase their parallelism andperformance with increasing transistordensity.

    GPU technology developmentThe demand for faster and higher-

    definition graphics continues to drive the

    John Nickolls

    William J. Dally

    NVIDIA

    ................................................

    56 Pub lished by the IEEE Comp uter Society 0 272-1732/10/$26.00 c 2010 IEEE

  • 8/13/2019 Gpu IEEE Paper

    2/14

    development of increasingly parallel GPUs.Table 1 lists significant milestones in NVIDIAGPU technology development that drove theevolution of unified graphics and computing

    GPUs. GPU transistor counts increased expo-nentially, doubling roughly every 18 monthswith increasing semiconductor density. Sincetheir 2006 introduction, CUDA parallelcomputing cores per GPU also doublednearly every 18 months.

    In the early 1990s, there were no GPUs.Video graphics array (VGA) controllers gen-erated 2D graphics displays for PCs to accel-erate graphical user interfaces. In 1997,NVIDIA released the RIVA 128 3D single-chip graphics accelerator for games and 3D

    visualization applications, programmed withMicrosoft Direct3D and OpenGL. Evolvingto modern GPUs involved adding pro-grammability incrementallyfrom fixedfunction pipelines to microcoded processors,configurable processors, programmable pro-cessors, and scalable parallel processors.

    Early GPUsThe first GPU was the GeForce 256, a

    single-chip 3D real-time graphics processorintroduced in 1999 that included nearly

    every feature of high-end workstation 3Dgraphics pipelines of that era. It containeda configurable 32-bit floating-point vertextransform and lighting processor, and a con-

    figurable integer pixel-fragment pipeline,programmed with OpenGL and MicrosoftDirectX 7 (DX7) APIs.

    GPUs first used floating-point arithmeticto calculate 3D geometry and vertices, thenapplied it to pixel lighting and color valuesto handle high-dynamic-range scenes andto simplify programming. They imple-mented accurate floating-point rounding toeliminate frame-varying artifacts on movingpolygon edges that would otherwise sparkleat real-time frame rates.

    As programmable shaders emerged, GPUsbecame more flexible and programmable. In2001, the GeForce 3 introduced the first pro-grammable vertex processor that executedvertex shader programs, along with a config-urable 32-bit floating-point pixel-fragmentpipeline, programmed with OpenGL andDX8. The ATI Radeon 9700, introducedin 2002, featured a programmable 24-bitfloating-point pixel-fragment processor pro-grammed with DX9 and OpenGL. TheGeForce FX and GeForce 68005 featured

    Table 1. NVIDIA GPU technology development.

    Date Product Transistors CUDA cores Technology

    1997 RIVA 128 3 million 3D graphics accelerator

    1999 GeForce 256 25 million First GPU, programmed with DX7 and OpenGL

    2001 GeForce 3 60 million First programmable shader GPU, programmedwith DX8 and OpenGL

    2002 GeForce FX 125 million 32-bit floating-point (FP) programmable GPU with

    Cg programs, DX9, and OpenGL

    2004 GeForce 6800 222 million 32-bit FP programmable scalable GPU, GPGPU

    Cg programs, DX9, and OpenGL

    2006 GeForce 8800 681 million 128 First unified graphics and computing GPU,

    programmed in C with CUDA

    2007 Tesla T8, C870 681 million 128 First GPU computing system programmed in C

    with CUDA

    2008 GeForce GTX 280 1.4 billion 240 Unified graphics and computing GPU, IEEE FP,

    CUDA C, OpenCL, and DirectCompute

    2008 Tesla T10, S1070 1.4 billion 240 GPU computing clusters, 64-bit IEEE FP, 4-Gbyte

    memory, CUDA C, and OpenCL

    2009 Fermi 3.0 billion 512 GPU computing architecture, IEEE 754-2008 FP,

    64-bit unified addressing, caching, ECC

    memory, CUDA C, C++, OpenCL, and

    DirectCompute

    ..........................................................

    MARCH/APRIL 2010 57

  • 8/13/2019 Gpu IEEE Paper

    3/14

    programmable 32-bit floating-point pixel-fragment processors and vertex processors,programmed with Cg programs, DX9, andOpenGL. These processors were highly multi-threaded, creating a thread and executing a

    thread program for each vertex and pixelfragment. The GeForce 6800 scalable pro-cessor core architecture facilitated multipleGPU implementations with different num-bers of processor cores.

    Developing the Cg language6 for pro-gramming GPUs provided a scalable parallelprogramming model for the programmablefloating-point vertex and pixel-fragment pro-cessors of GeForce FX, GeForce 6800, andsubsequent GPUs. A Cg program resemblesa C program for a single thread that draws

    a single vertex or single pixel. The multi-threaded GPU created independent threadsthat executed a shader program to drawevery vertex and pixel fragment.

    In addition to rendering real-time graph-ics, programmers also used Cg to computephysical simulations and other general-purpose GPU (GPGPU) computations.Early GPGPU computing programs achievedhigh performance, but were difficult to writebecause programmers had to express non-graphics computations with a graphics API

    such as OpenGL.

    Unified computing and graphics GPUsThe GeForce 8800 introduced in 2006

    featured the first unified graphics and com-puting GPU architecture7,8 programmablein C with the CUDA parallel computingmodel, in addition to using DX10 andOpenGL. Its unified streaming processorcores executed vertex, geometry, and pixelshader threads for DX10 graphics programs,and also executed computing threads for

    CUDA C programs. Hardware multithread-ing enabled the GeForce 8800 to efficientlyexecute up to 12,288 threads concurrentlyin 128 processor cores. NVIDIA deployedthe scalable architecture in a family ofGeForce GPUs with different numbers ofprocessor cores for each market segment.

    The GeForce 8800 was the first GPU touse scalar thread processors rather than vectorprocessors, matching standard scalar languageslike C, and eliminating the need to managevector registers and program vector

    operations. It added instructions to supportC and other general-purpose languages,including integer arithmetic, IEEE 754floating-point arithmetic, and load/storememory access instructions with byte address-

    ing. It provided hardware and instructions tosupport parallel computation, communica-tion, and synchronizationincluding threadarrays, shared memory, and fast barriersynchronization.

    GPU computing systemsAt first, users built personal supercom-

    puters by adding multiple GPU cards toPCs and workstations, and assembled clustersof GPU computing nodes. In 2007, respond-ing to demand for GPU computing systems,

    NVIDIA introduced the Tesla C870, D870,and S870 GPU card, deskside, and rack-mount GPU computing systems containingone, two, and four T8 GPUs. The T8GPU was based on the GeForce 8800GPU, configured for parallel computing.

    The second-generation Tesla C1060 andS1070 GPU computing systems introducedin 2008 used the T10 GPU, based on theGPU in GeForce GTX 280. The T10 fea-tured 240 processor cores, 1-teraflop-per-second peak single-precision floating-point

    rate, IEEE 754-2008 double-precision 64-bit floating-point arithmetic, and 4-GbyteDRAM memory. Today there are TeslaS1070 systems with thousands of GPUswidely deployed in high-performance com-puting systems in production and research.

    NVIDIA introduced the third-generationFermi GPU computing architecture in2009.9 Based on user experience with priorgenerations, it addressed several key areas tomake GPU computing more broadly appli-cable. Fermi implemented IEEE 754-2008

    and significantly increased double-precisionperformance. It added error-correcting code(ECC) memory protection for large-scaleGPU computing, 64-bit unified addressing,cached memory hierarchy, and instructionsfor C, C, Fortran, OpenCL, andDirectCompute.

    GPU computing ecosystemThe GPU computing ecosystem is expand-

    ing rapidly, enabled by the deployment ofmore than 180 million CUDA-capable

    ....................................................

    58 IEEE MICRO

    ...............................................................................................................................................................................................

    HOT CHIPS

  • 8/13/2019 Gpu IEEE Paper

    4/14

    GPUs. Researchers and developers have en-thusiastically adopted CUDA and GPUcomputing for a diverse range of applica-tions,10 publishing hundreds of technicalpapers, writing parallel programming text-

    books,

    11

    and teaching CUDA programmingat more than 300 universities. The CUDAZone (see http://www.nvidia.com/object/cuda_home_new.html) lists more than1,000 links to GPU computing applications,programs, and technical papers. The 2009GPU Technology Conference (see http://www.nvidia.com/object/research_summit_posters.html) published 91 research posters.

    Library and tools developers are makingGPU development more productive. GPUcomputing languages include CUDA C,

    CUDA C, Portland Group (PGI)CUDA Fortran, DirectCompute, andOpenCL. GPU mathematics packages in-clude MathWorks Matlab, Wolfram Mathe-matica, National Instruments Labview, Sci-Comp SciFinance, and PyCUDA. NVIDIAdeveloped the parallel Nsight GPU develop-ment environment, debugger, and analyzerintegrated with Microsoft Visual Studio.GPU libraries include C productivitylibraries, dense linear algebra, sparse linearalgebra, FFTs, video and image processing,

    and data-parallel primitives. Computer systemmanufacturers are developing integratedCPUGPU coprocessing systems in rack-mount server and cluster configurations.

    CUDA scalable parallel architectureCUDA is a hardware and software copro-

    cessing architecture for parallel computingthat enables NVIDIA GPUs to execute pro-grams written with C, C, Fortran,OpenCL, DirectCompute, and other lan-guages. Because most languages were

    designed for one sequential thread, CUDApreserves this model and extends it with aminimalist set of abstractions for expressingparallelism. This lets the programmer focuson the important issues of parallelismhowto design efficient parallel algorithmsusing a familiar language.

    By design, CUDA enables the develop-ment of highly scalable parallel programsthat can run across tens of thousands of con-current threads and hundreds of processorcores. A compiled CUDA program executes

    on any size GPU, automatically using moreparallelism on GPUs with more processorcores and threads.

    A CUDA program is organized into ahost program, consisting of one or more se-quential threads running on a host CPU, andone or more parallel kernels suitable for exe-cution on a parallel computing GPU. A ker-nel executes a sequential program on a set oflightweight parallel threads. As Figure 1shows, the programmer or compiler organ-

    izes these threads into a grid of thread blocks.The threads comprising a thread block cansynchronize with each other via barriersand communicate via a high-speed, per-block shared memory.

    Threads from different blocks in the samegrid can coordinate via atomic operations ina global memory space shared by all threads.Sequentially dependent kernel grids can syn-chronize via global barriers and coordinatevia global shared memory. CUDA requiresthat thread blocks be independent, which

    Grid 1

    Grid 0

    Per-applicationglobal memory

    Per-thread privatelocal memory

    Thread

    Thread block

    Per-block

    shared memory

    Figure 1. The CUDA hierarchy of threads, thread blocks, and grids of

    blocks, with corresponding memory spaces: per-thread private local,

    per-block shared, and per-application global memory spaces.

    ..........................................................

    MARCH/APRIL 2010 59

  • 8/13/2019 Gpu IEEE Paper

    5/14

    provides scalability to GPUs with different

    numbers of processor cores and threads.Thread blocks implement coarse-grainedscalable data parallelism, while the light-weight threads comprising each threadblock provide fine-grained data parallelism.Thread blocks executing different kernelsimplement coarse-grained task parallelism.Threads executing different paths implementfine-grained thread-level parallelism. Detailsof the CUDA programming model are avail-able in the programming guide.2

    Figure 2 shows some basic features of par-

    allel programming with CUDA. It containssequential and parallel implementations ofthe SAXPY routine defined by the basic linearalgebra subroutines (BLAS) library. Givenscalar a and vectors x and y containing nfloating-point numbers, it performs the up-date y ax y. The serial implementationis a simple loop that computes one elementof yper iteration. The parallel kernel executeseach of these independent iterations in paral-lel, assigning a separate thread to computeeach element of y. The __global__

    modifier indicates that the procedure is a ker-nel entry point, and the extended function-call syntax saxpy(. . .)launches the kernel saxpy() in parallelacrossBblocks ofTthreads each. Each threaddetermines which element it should pro-cess from its integer thread block indexblockIdx.x, its thread index within itsblockthreadIdx.x, and the total numberof threads per block blockDim.x.

    This example demonstrates a com-mon parallelization pattern, where we can

    transform a serial loop with independent

    iterations to execute in parallel across manythreads. In the CUDA paradigm, the pro-grammer writes a scalar programthe paral-lel saxpy() kernelthat specifies thebehavior of a single thread of the kernel.This lets CUDA leverage standard C lan-guage with only a few small additions, suchas built-in thread and block index variables.The SAXPY kernel is also a simple exampleof data parallelism, where parallel threadseach produce assigned result data elements.

    GPU computing architectureTo address different market segments,GPU architectures scale the number of pro-cessor cores and memories to implement dif-ferent products for each segment while usingthe same scalable architecture and software.NVIDIAs scalable GPU computing archi-tecture varies the number of streaming multi-processors to scale computing performance,and varies the number of DRAM memoriesto scale memory bandwidth and capacity.

    Each multithreaded streaming multiproc-

    essor provides sufficient threads, processorcores, and shared memory to execute oneor more CUDA thread blocks. The parallelprocessor cores within a streaming multi-processor execute instructions for parallelthreads. Multiple streaming multiprocessorsprovide coarse-grained scalable data andtask parallelism to execute multiple coarse-grained thread blocks (possibly running dif-ferent kernels) in parallel. Multithreadingand parallel-pipelined processor cores withineach streaming multiprocessor implement

    void saxpy(uint n, float a,float *x, float *y)

    {uint i;for(i = 0; i < n; ++i)

    y[i] = a*x[i] + y[i];

    }

    void serial_sample(){

    // Call serial SAXPY function

    saxpy(n, 2.0, x, y);}

    (a)

    __global__ void saxpy(uint n, float a,float *x, float *y)

    {uint i = blockIdx.x*blockDim.x + threadIdx.x;if(i < n)

    y[i] = a*x[i] + y[i];

    }

    void parallel_sample(){

    // Launch parallel SAXPY kernel

    // using n/256 blocks of 256 threads eachsaxpy(n, 2.0, x, y);

    }

    (b)

    Figure 2. Serial (a) and parallel CUDA (b) SAXPY kernels computing y ax y.

    ....................................................

    60 IEEE MICRO

    ...............................................................................................................................................................................................

    HOT CHIPS

  • 8/13/2019 Gpu IEEE Paper

    6/14

    fine-grained data and thread-level parallelismto execute hundreds of fine-grained threadsin parallel. Application programs using theCUDA model thus scale transparently tosmall and large GPUs with different num-bers of streaming multiprocessors and pro-cessor cores.

    Fermi computing architectureTo illustrate GPU computing architecture,

    Figure 3 shows the third-generation Fermicomputing architecture configured with

    16 streaming multiprocessors, each with32 CUDA processor cores, for a total of512 cores. The GigaThread work schedulerdistributes CUDA thread blocks to streamingmultiprocessors with available capacity, dy-namically balancing the computing workloadacross the GPU, and running multiple kerneltasks in parallel when appropriate. The multi-threaded streaming multiprocessors scheduleand execute CUDA thread blocks and indi-vidual threads. Each streaming multiprocessorexecutes up to 1,536 concurrent threads to

    help cover long latency loads from DRAMmemory. As each thread block completes exe-cuting its kernel program and releases itsstreaming multiprocessor resources, the workscheduler assigns a new thread block to thatstreaming multiprocessor.

    The PCIe host interface connectsthe GPU and its DRAM memory withthe host CPU and system memory. TheCPUGPU coprocessing and data transfersuse the bidirectional PCIe interface. Thestreaming multiprocessor threads access

    system memory via the PCIe interface, andCPU threads access GPU DRAM memoryvia PCIe.

    The GPU architecture balances its parallelcomputing power with parallel DRAMmemory controllers designed for high mem-ory bandwidth. The Fermi GPU in Figure 3has six high-speed GDDR5 DRAM interfa-ces, each 64 bits wide. Its 40-bit addresseshandle up to 1 Tbyte of address space forGPU DRAM and CPU system memoryfor large-scale computing.

    D

    RAMi

    nterface

    DRAMi

    nterface

    Hostinterface

    GigaThread

    L2 cacheDRAMi

    nterface

    DRAMi

    nterface

    D

    RAMi

    nterface

    DRAMi

    nterface

    SM SM SM SM SM SM SM SM

    SMSMSMSMSMSMSMSM

    L1 L1 L1 L1 L1 L1 L1 L1

    L1L1L1L1L1L1L1L1

    Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex

    TexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTex

    Figure 3. Fermi GPU computing architecture with 512 CUDA processor cores organized as 16 streaming multiprocessors

    (SMs) sharing a common second-level (L2) cache, six 64-bit DRAM interfaces, and a host interface with the host CPU,

    system memory, and I/O devices. Each streaming multiprocessor has 32 CUDA cores.

    ..........................................................

    MARCH/APRIL 2010 61

  • 8/13/2019 Gpu IEEE Paper

    7/14

    Cached memory hierarchyFermi introduces a parallel cached mem-

    ory hierarchy for load, store, and atomicmemory accesses by general applications.Each streaming multiprocessor has a first-level

    (L1) data cache, and the streaming multi-processors share a common 768-Kbyte unifiedsecond-level (L2) cache. The L2 cache connectswith six 64-bit DRAM interfaces and the PCIeinterface, which connects with the host CPU,system memory, and PCIe devices. It cachesDRAM memory locations and system mem-ory pages accessed via the PCIe interface.The unified L2 cache services load, store,atomic, and texture instruction requests fromthe streaming multiprocessors and requestsfrom their L1 caches, and fills the streaming

    multiprocessor instruction caches and uni-form data caches.

    Fermi implements a 40-bit physicaladdress space that accesses GPU DRAM,CPU system memory, and PCIe deviceaddresses. It provides a 40-bit virtual addressspace to each application context and maps itto the physical address space with translationlookaside buffers and page tables.

    ECC memoryFermi introduces ECC memory protec-

    tion to enhance data integrity in large-scaleGPU computing systems. Fermi ECC cor-rects single-bit errors and detects double-biterrors in the DRAM memory, GPU L2cache, L1 caches, and streaming multiproces-sor registers. The ECC lets us integrate thou-sands of GPUs in a system while maintaininga high mean time between failures (MTBF)for high-performance computing and super-computing systems.

    Streaming multiprocessorThe Fermi streaming multiprocessor

    introduces several architectural features thatdeliver higher performance, improve its pro-grammability, and broaden its applicability.

    As Figure 4 shows, the streaming multiproc-essor execution units include 32 CUDA pro-cessor cores, 16 load/store units, and fourspecial function units (SFUs). It has a64-Kbyte configurable shared memory/L1cache, 128-Kbyte register file, instructioncache, and two multithreaded warp schedu-lers and instruction dispatch units.

    Efficient multithreadingThe streaming multiprocessor implements

    zero-overhead multithreading and threadscheduling for up to 1,536 concurrentthreads. To efficiently manage and execute

    this many individual threads, the multiproc-essor employs the single-instruction multiple-thread (SIMT) architecture introduced in thefirst unified computing GPU.7,8 The SIMTinstruction logic creates, manages, schedules,and executes concurrent threads in groups of32 parallel threads called warps. A CUDAthread block comprises one or more warps.Each Fermi streaming multiprocessor hastwo warp schedulers and two dispatch unitsthat each select a warp and issue an instruc-tion from the warp to 16 CUDA cores,

    16 load/store units, or four SFUs. Becausewarps execute independently, the streamingmultiprocessor can issue two warp instruc-tions to appropriate sets of CUDA cores,load/store units, and SFUs.

    To support C, C, and standardsingle-thread programming languages, eachstreaming multiprocessor thread is indepen-dent, having its own private registers, condi-tion codes and predicates, private per-threadmemory and stack frame, instructionaddress, and thread execution state. The

    SIMT instructions control the execution ofan individual thread, including arithmetic,memory access, and branching and controlflow instructions. For efficiency, the SIMTmultiprocessor issues an instruction to awarp of 32 independent parallel threads.

    The streaming multiprocessor realizes fullefficiency and performance when all threadsof a warp take the same execution path. Ifthreads of a warp diverge at a data-dependentconditional branch, execution serializes foreach branch path taken, and when all paths

    complete, the threads converge to the sameexecution path. The Fermi streaming multi-processor extends the flexibility of the SIMTindependent thread control flow with indi-rect branch and function-call instructions,and trap handling for exceptions anddebuggers.

    Thread instructionsParallel thread execution (PTX) instruc-

    tions describe the execution of a single threadin a parallel CUDA program. The PTX

    ....................................................

    62 IEEE MICRO

    ...............................................................................................................................................................................................

    HOT CHIPS

  • 8/13/2019 Gpu IEEE Paper

    8/14

    instructions focus on scalar (rather thanvector) operations to match standard scalarprogramming languages. Fermi implementsthe PTX 2.0 instruction set architecture(ISA), which targets C, C, Fortran,OpenCL, and DirectCompute programs.Instructions include

    32-bit and 64-bit integer, addressing,and floating-point arithmetic;

    load, store, and atomic memory access; texture and multidimensional surface

    access;

    individual thread flow control with pre-dicated instructions, branching, func-tion calls, and indirect function callsfor C virtual functions; and

    parallel barrier synchronization.

    CUDA coresEach pipelined CUDA core executes a

    scalar floating point or integer instructionper clock for a thread. With 32 cores, thestreaming multiprocessor can execute up to32 arithmetic thread instructions per clock.

    FP = Floating pointINT = Integer arithmetic logicLD/ST = Load/store

    SFU = Special function unit

    Instruction cache

    Warp scheduler Warp scheduler

    Dispatch unitDispatch unit

    Register file (128 Kbytes)

    Core Core

    Core Core

    Core Core

    Core

    Core

    Core

    Core Core

    Core Core

    CoreCore

    CoreCore

    CoreCore

    Core

    Core

    Core

    CoreCore

    CoreLD/ST

    LD/ST

    LD/ST

    LD/ST

    LD/ST

    LD/ST

    LD/ST

    LD/STLD/ST

    LD/ST

    LD/ST

    LD/ST

    LD/ST

    LD/ST

    LD/ST

    LD/ST

    SFU

    SFU

    SFU

    SFU

    Uniform cache

    64-Kbyte shared memory and L1 cache

    Core

    Core Core

    CoreCore

    Core

    Core

    Interconnect network

    CUDA core

    Dispatch port

    Operand collector

    FP unit

    Result.queue

    INT unit

    Figure 4. The Fermi streaming multiprocessor has 32 CUDA processor cores, 16 load/store

    units, four special function units, a 64-Kbyte configurable shared memory/L1 cache,

    128-Kbyte register file, instruction cache, and two multithreaded warp schedulers and

    instruction dispatch units.

    ..........................................................

    MARCH/APRIL 2010 63

  • 8/13/2019 Gpu IEEE Paper

    9/14

    The integer unit implements 32-bit precisionfor scalar integer operations, including 32-bitmultiply and multiply-add operations, andefficiently supports 64-bit integer operations.The Fermi integer unit adds bit-field insert

    and extract, bit reverse, and populationcount.

    IEEE 754-2008 floating-point arithmeticThe Fermi CUDA core floating-point unit

    implements the IEEE 754-2008 floating-point arithmetic standard for 32-bit single-precision and 64-bit double-precision results,including fused multiply-add (FMA) instruc-tions. FMA computes D A* B Cwithno loss of precision by retaining full precisionin the intermediate product and addition,

    then rounding the final sum to form theresult. Using FMA enables fast division andsquare-root operations with exactly roundedresults.

    Fermi raises the throughput of 64-bitdouble-precision operations to half that ofsingle-precision operations, a dramatic im-provement over the T10 GPU. This perfor-mance level enables broader deployment ofGPUs in high-performance computing.The floating-point instructions handle sub-normal numbers at full speed in hardware,

    allowing small values to retain partialprecision rather than flushing them tozero or calculating subnormal values inmulticycle software exception handlers asmost CPUs do.

    The SFUs execute 32-bit floating-pointinstructions for fast approximations of recip-rocal, reciprocal square root, sin, cos, exp,and log functions. The approximations areprecise to better than 22 mantissa bits.

    Unified memory addressing and access

    The streaming multiprocessor load/storeunits execute load, store, and atomic mem-ory access instructions. A warp of 32 activethreads presents 32 individual byte addresses,and the instruction accesses each memoryaddress. The load/store units coalesce 32individual thread accesses into a minimalnumber of memory block accesses.

    Fermi implements a unified thread ad-dress space that accesses the three separateparallel memory spaces of Figure 1: per-thread local, per-block shared, and global

    memory spaces. A unified load/store in-struction can access any of the three mem-ory spaces, steering the access to the correctmemory, which enables general C andC pointer access anywhere. Fermi pro-

    vides a terabyte 40-bit unified byte addressspace, and the load/store ISA supports64-bit byte addressing for future growth.The ISA also provides 32-bit addressinginstructions when the program can limitits accesses to the lower 4 Gbytes of addressspace.

    Configurable shared memory and L1 cacheOn-chip shared memory provides low-

    latency, high-bandwidth access to datashared by cooperating threads in the same

    CUDA thread block. Fast shared memorysignificantly boosts the performance ofmany applications having predictable regularaddressing patterns, while reducing DRAMmemory traffic.

    Fermi introduces a configurable-capacityL1 cache to aid unpredictable or irregularmemory accesses, along with a configurable-capacity shared memory. Each streamingmultiprocessor has 64 Kbytes of on-chipmemory, configurable as 48 Kbytes of sharedmemory and 16 Kbytes of L1 cache, or as

    16 Kbytes of shared memory and 48 Kbytesof L1 cache.

    CPU+GPU coprocessingHeterogeneous CPUGPU coprocessing

    systems evolved because the CPU and GPUhave complementary attributes that allowapplications to perform best using bothtypes of processors. CUDA programs arecoprocessing programsserial portions exe-cute on the CPU, while parallel portionsexecute on the GPU. Coprocessing optimizes

    total application performance.With coprocessing, we use the right core

    for the right job. We use a CPU core (opti-mized for low latency on a single thread) fora codes serial portions, and we use GPUcores (optimized for aggregate throughputon a codes parallel portions) for parallel por-tions of code. This approach gives moreperformance per unit area or power thaneither CPU or GPU cores alone.

    The comparison in Table 2 illustrates theadvantage of CPUGPU coprocessing using

    ....................................................

    64 IEEE MICRO

    ...............................................................................................................................................................................................

    HOT CHIPS

  • 8/13/2019 Gpu IEEE Paper

    10/14

    Amdahls law. The table compares the per-formance of four configurations:

    a system containing one latency-optimized (CPU) core,

    a system containing 500 throughput-optimized (GPU) cores,

    a system containing 10 CPU cores, and a coprocessing system that contains a

    single CPU core and 450 GPU cores.

    Table 2 assumes that a CPU core is 5faster and 50 the area of a GPU corenumbers consistent with contemporaryCPUs and GPUs. The coprocessing systemdevotes 10 percent of its area to the singleCPU core and 90 percent of its area tothe 450 GPU cores.

    Table 2 compares the four configurationson both a parallel-intensive program (with aserial fraction of only 0.5 percent) and on amostly sequential program (with a serial frac-tion of 75 percent). The coprocessing archi-

    tecture is the fastest on both programs. Onthe parallel-intensive program, the coprocess-ing architecture is slightly slower on theparallel portion than the pure GPU configu-ration (0.44 seconds versus 0.40 seconds) butmore than makes up for this by running thetiny serial portion 5 faster (1 second versus5 seconds). The heterogeneous architecturehas an advantage over the pure throughput-optimized configuration here because serialperformance is important even for mostlyparallel codes.

    On the mostly sequential program, thecoprocessing configuration matches the per-formance of the multi-CPU configurationon the codes serial portion but runs the par-allel portion 45 as fast, giving a slightlyfaster overall performance. Even on mostlysequential codes, its more efficient to runthe codes parallel portion on a throughput-optimized architecture.

    The coprocessing architecture provides

    the best performance across a wide rangeof the serial fraction because it uses theright core for each task. By using a latency-optimized CPU to run the codes serialfraction, it gives the best possible perfor-mance on the serial fractionwhich is im-portant even for mostly parallel codes. Byusing throughput-optimized cores to runthe codes parallel portion, it gives near-optimal performance on the parallel frac-tion as wellwhich becomes increasinglyimportant as codes become more parallel.

    Its wasteful to use large, inefficientlatency-optimized cores to run parallelcode segments.

    Application performanceMany applications consist of a mixture of

    fundamentally serial control logic and inher-ently parallel computations. Furthermore,these parallel computations are frequentlydata-parallel in nature. This directly matchesthe CUDA coprocessing programmingmodel, namely a sequential control thread

    Table 2. CPU+GPU coprocessing execution time, assuming that a CPU core is 53 faster

    and 503 the area of a GPU core.

    Configuration

    Processing

    time for 1 CPU

    core

    Processing

    time for 500

    GPU cores

    Processing

    time for 10 CPU

    cores

    Processing time

    for 1 CPU core +

    450 GPU cores

    Program type Area 50 500 500 500

    Parallel-intensive

    program

    0.5% serial code

    99.5% paralleliz-

    able code

    Total

    1.0

    199.0

    200.0

    5.0

    0.4

    5.4

    1.0

    19.9

    20.9

    1.00

    0.44

    1.44

    Mostly sequential

    program

    75% serial code

    25% paralleliz-

    able code

    Total

    150.0

    50.0

    200.0

    750.0

    0.1

    750.1

    150.0

    5.0

    155.0

    150.0

    0.11

    150.11

    ..........................................................

    MARCH/APRIL 2010 65

  • 8/13/2019 Gpu IEEE Paper

    11/14

    capable of launching a series of parallel ker-nels. The use of parallel kernels launchedfrom a sequential program also makes it rel-atively easy to parallelize an applications in-dividual components rather than rewrite theentire application.

    Many applicationsboth academic re-search and industrial productshave beenaccelerated using CUDA to achieve signifi-cant parallel speedups.10 Such applicationsfall into a variety of problem domains,including quantum chemistry, moleculardynamics, computational fluid dynamics,quantum physics, finite element analysis,astrophysics, bioinformatics, computer vi-sion, video transcoding, speech recognition,computed tomography, and computationalmodeling.

    Table 3 lists some representative applica-tions along with the runtime speedupsobtained for the whole application usingCPUGPU coprocessing over CPU alone,as measured by application developers.12-22

    The speedups using GeForce 8800, TeslaT8, GeForce GTX 280, Tesla T10, andGeForce GTX 285 range from 9 to morethan 130, with the higher speedups reflect-ing applications where more of the work ran inparallel on the GPU. The lower speedupswhile still quite attractiverepresent

    applications that are limited by the codesCPU portion, coprocessing overhead, or by di-vergence in the codes GPU fraction.

    The speedups achieved on this diverse setof applications validate the programmabilityof GPUsin addition to their performance.

    The breadth of applications that have beenported to CUDA (more than 1,000 on theCUDA Zone) demonstrates the range ofthe programming model and architecture.

    Applications with dense matrices, sparsematrices, and arbitrary pointer structureshave all been successfully implemented inCUDA with impressive speedups. Similarly,applications with diverse control structuresand significant data-dependent control,such as ray tracing, have achieved good per-formance in CUDA.

    Many real-world applications (such asinteractive ray tracing) are composed ofmany different algorithms, each with varyingdegrees of parallelism. OptiX, our interactiveray-tracing software developers kit built inthe CUDA architecture, provides a mecha-nism to control and schedule a wide varietyof tasks on both the CPU and GPU. Sometasks are primarily serial and execute onthe CPU, such as compilation, data structuremanagement, and coordination with theoperating system and user interaction.

    Table 3. Representative CUDA application coprocessing speedups.

    Application Field Speedup

    Two-electron repulsion integral12 Quantum chemistry 130

    Gromacs13 Molecular dynamics 137

    Lattice Boltzmann14

    3D computational fluiddynamics (CFD)

    100

    Euler solver15 3D CFD 16

    Lattice quantum chromodynamics16 Quantum physics 10

    Multigrid finite element method and

    partial differential equation solver17Finite element analysis 27

    N-body physics18 Astrophysics 100

    Protein multiple sequence alignment19 Bioinformatics 36

    Image contour detection20 Computer vision 130

    Portable media converter* Consumer video 20

    Large vocabulary speech recognition21 Human interaction 9

    Iterative image reconstruction22 Computed tomography 130

    Matlab accelerator** Computational modeling 100................................................................................................................................* Elemental Technologies, Badaboom media converter, 2009; http://badaboomit.com.** Accelereyes, Jacket GPU engine for Matlab, 2009; http://www.accelereyes.com.

    ....................................................

    66 IEEE MICRO

    ...............................................................................................................................................................................................

    HOT CHIPS

  • 8/13/2019 Gpu IEEE Paper

    12/14

    Other tasks, such as building an accelerationstructure or updating animations, may runeither on the CPU or the GPU dependingon the choice of algorithms and the perfor-mance required. However, the real power

    of OptiX is the manner in which it managesparallel operations that are either regularor highly irregular in a fine-grained fashion,and this is executed entirely on the GPU.

    To accomplish this, OptiX decomposesthe algorithm into a series of states. Someof these states consist of uniform operationssuch as generating rays according to a cameramodel, or applying a tone-mapping operator.Other states consist of highly irregular com-putation, such as traversing an accelerationstructure to locate candidate geometry, or

    intersecting a ray with that candidate geom-etry. These operations are highly irregular be-cause they require an indefinite number ofoperations and can vary in cost by an orderof magnitude or more between rays. Auser-provided program that executes withinan abstract ray-tracing machine controls thedetails of each operation.

    To execute this state machine, each warpof parallel threads on the GPU selects ahandful of rays that have matching state.Each warp executes in SIMT fashion, caus-

    ing each ray to transition to a new state(which might not be the same for all rays).This process repeats until all rays completetheir operations. OptiX can control boththe prioritization of the states and the scopeof the search for rays to target trade-offs indifferent GPU generations. Even if rays tem-porarily diverge to different states, theyll re-turn to a small number of common statesusually traversal and intersection. This state-machine model gives the GPU an opportu-nity to reunite diverged threads with other

    threads executing the same code, whichwouldnt occur in a straightforward SIMTexecution model.

    The net result is that CPUGPU copro-cessing enables fast, interactive ray tracing ofcomplex scenes while you watch, which is anapplication that researchers previously con-sidered too irregular for a GPU.

    G PU computing is at the tipping point.Single-threaded processor perfor-mance is no longer scaling at historic rates.

    Thus, we must use parallelism for theincreased performance required to delivermore value to users. A GPU thats opti-mized for throughput delivers parallelperformance much more efficiently than

    a CPU thats optimized for latency.A heterogeneous coprocessing architecturethat combines a single latency-optimizedcore (a CPU) with many throughput-optimized cores (a GPU) performs betterthan either alternative alone. This is becauseit uses the right processor for the right jobthe CPU for serial sections and critical pathsand the GPU for the parallel sections.

    Because they efficiently execute programsacross a range of parallelism, heterogeneousCPUGPU architectures are becoming

    pervasive. In high-performance computing,technical computing, and consumer mediaprocessing, CPUGPU coprocessing hasbecome the architecture of choice. Weexpect the rate of adoption of GPU com-puting in these beachhead areas to acceler-ate, and for GPU computing to spread tobroader application areas. In addition toaccelerating existing applications, we expectGPU computing to enable new classes ofapplications, such as building compellingvirtual worlds and performing universal

    speech translation.As the GPU computing era progresses,

    GPU performance will continue to scale atMoores law ratesabout 50 percent peryeargiving an order of magnitudeincrease in performance in five to six years.

    At the same time, GPU architecture willevolve to further increase the span of appli-cations that it can efficiently address. GPUcores will not become CPUsthey will con-tinue to be optimized for throughput, ratherthan latency. However, they will evolve to

    become more agile and better able to handlearbitrary control and data access patterns.GPUs and their programming systems willalso evolve to make GPUs even easier toprogramfurther accelerating their rate ofadoption. M ICRO

    Acknowledgments

    We thank Jen-Hsun Huang of NVIDIAfor his Hot Chips 21 keynote23 that inspiredthis article, and the entire NVIDIA team thatbrings GPU computing to market.

    ..........................................................

    MARCH/APRIL 2010 67

  • 8/13/2019 Gpu IEEE Paper

    13/14

    ....................................................................References

    1. J. Nickolls et al., Scalable Parallel Program-

    ming with CUDA, ACM Queue, vol. 6,

    no. 2, 2008, pp. 40-53.

    2. NVIDIA,NVIDIA CUDA Programming Guide,

    2009; http://developer.download.nvidia.c o m/ c o mp u te / c u d a / 2 _ 3 / to o l k i t / d o c s /

    NVIDIA_CUDA_Programming_Guide_2.3.pdf.

    3. C. Boyd, DirectCompute: Capturing the

    Teraflop, Microsoft Personal Developers

    Conf., 2009; http://ecn.channel9.msdn.

    com/o9/pdc09/ppt/CL03.pptx.

    4. Khronos,The OpenCL Specification, 2009;

    http://www.khronos.org/OpenCL.

    5. J. Montrym and H. Moreton, The GeForce

    6800, IEEE Micro, vol. 25, no. 2, 2005,

    pp. 41-51.

    6. W.R. Mark et al., Cg: A System for Pro-gramming Graphics Hardware in a C-like

    Language, Proc. Special Interest Group

    on Computer Graphics (Siggraph), ACM

    Press, 2003, pp. 896-907.

    7. E. Lindholm et al., NVIDIA Tesla: A Unified

    Graphics and Computing Architecture,

    IEEE Micro, vol. 28, no. 2, 2008, pp. 39-55.

    8. J. Nickolls and D. Kirk, Graphics and

    Computing GPUs, Computer Organization

    and Design: The Hardware/Software

    Interface, D .A . P att er so n an d J. L.

    Hennessy, 4th ed., Morgan Kaufmann,

    2009, pp. A2-A77.

    9. NVIDIA, Fermi: NVIDIAs Next Generation

    CUDA Compute Architecture, 2009; http://

    www.nvidia.com/content/PDF/fermi_white_

    papers/NVIDIA_Fermi_Compute_Architecture_

    Whitepaper.pdf.

    10. M. Garland et al., Parallel Computing Expe-

    riences with CUDA, IEEE Micro, vol. 28,

    no. 4, 2008, pp. 13-27.

    11. D.B. Kirk and W.W. Hwu, Programming

    Massively Parallel Processors: A Hands-on

    Approach, Morgan Kaufmann, 2010.

    12. I.S. Ufimtsev and T.J. Martinez, Quantum

    Chemistry on Graphical Processing Units.

    1. Strategies for Two-Electron Integral Eval-

    uation,J. Chemical Theory and Computa-

    tion, vol. 4, no. 2, 2008, pp. 222-231.

    13. M.S. Friedrichs et al., Accelerating Molec-

    ular Dynamic Simulation on Graphics

    Processing Units, J. Computational Chem-

    istry, vol. 30, no. 6, 2009, pp. 864-872.

    14. J. Tolke and M. Krafczyk, TeraFLOP Com-

    puting on a Desktop PC with GPUs for 3D

    CFD, Intl J. Computational Fluid Dynam-

    ics, vol. 22, no. 7, 2008, pp. 443-456.

    15. T. Brandvik and G. Pullan, Acceleration of a

    3D Euler Solver Using Commodity Graphics

    Hardware, Proc. 46th Am. Inst. of Aero-

    nautics and Astronautics (AIAA) AerospaceSciences Meeting, AIAA, 2008; http://

    www.aiaa.org/agenda.cfm?lumeetingid=

    1065&dateget=08-Jan-08#session8907.

    16. M.A. Clark et al., Solving Lattice QCD Sys-

    tems of Equations Using Mixed Precision

    Solvers on GPUs, Computer Physics

    Comm., 2009; http://arxiv.org/abs/0911.

    3191v2.

    17. D. Goddeke and R. Strzodka, Perfor-

    mance and Accuracy of Hardware-Oriented

    Native-, Emulated-, and Mixed-Precision

    Solvers in FEM Simulations (Part 2: DoublePrecision GPUs), Ergebnisberichte des

    Instituts fur Angewandte Mathematik

    [Reports on Findings of the Inst. for

    Applied Mathematics], Dortmund Univ. of

    Technology, no. 370, 2008; http://www.

    mathematik.uni-dortmund.de/~goeddeke/

    pubs/GTX280_mixedprecision.pdf.

    18. R.G. Belleman, J. Bedorf, and S.P. Zwart,

    High Performance Direct Gravitational

    N-body Simulations on Graphics Processing

    Units II: An Implementation in CUDA, New

    Astronomy,vol. 13, no. 2, 2008, pp. 103-112.19. Y. Liu, B. Schmidt, and D.L. Maskell, MSA-

    CUDA: Multiple Sequence Alignment on

    Graphics Processing Units with CUDA,

    Proc. 20th IEEE Intl Conf. Application-

    Specific Systems, Architectures and Pro-

    cessors,IEEE CS Press, 2009, pp. 121-128.

    20. B. Catanzaro et al., Efficient, High-Quality

    Image Contour Detection,Proc. IEEE Intl

    Conf. Computer Vision,IEEE CS Press, 2009;

    http://www.cs.berkeley.edu/~catanzar/

    Damascene/iccv2009.pdf.

    21. J. Chong et al., Data-Parallel Large Vocabu-

    lary Continuous Speech Recognition on

    Graphics Processors, tech. report

    UCB/EECS-2008-69, Univ. of California at

    Berkeley, 2008; http://www.eecs.berkeley.

    edu/Pubs/TechRpts/2008/EECS-2008-69.pdf.

    22. Y. Pan et al., Feasibility of GPU-Assisted

    Iterative Image Reconstruction for Mobile

    C-Arm CT, Proc. Intl Soc. for Photonics

    and Optonics (SPIE), vol. 7258, SPIE

    2009; http://www.sci.utah.edu/~ypan/Pan_

    SPIE2009.pdf.

    ....................................................

    68 IEEE MICRO

    ...............................................................................................................................................................................................

    HOT CHIPS

  • 8/13/2019 Gpu IEEE Paper

    14/14

    23. J.H. Huang, 2009: The GPU Computing Tip-

    ping Point, Proc. IEEE Hot Chips 21, 2009;

    http://www.hotchips.org/archives/hc21.

    John Nickolls is the director of architecture

    at NVIDIA for GPU computing. Histechnical interests include developing parallelprocessing products, languages, and architec-tures. Nickolls has a PhD in electricalengineering from Stanford University.

    William J. Dally is the chief scientist andsenior vice president of research at NVIDIAand the Willard R. and Inez Kerr BellProfessor of Engineering at StanfordUniversity. His technical interests includeparallel computer architecture, parallel

    programming systems, and interconnectionnetworks. Hes a member of the National

    Academy of Engineering and a fellow ofIEEE, ACM, and the American Academy of

    Arts and Sciences. Dally has a PhD in

    computer science from the CaliforniaInstitute of Technology.

    Direct questions and comments aboutthis article to John Nickolls, NVIDIA, 2701San Tomas Expressway, Santa Clara, CA95050; [email protected].

    ..........................................................

    MARCH/APRIL 2010 69