An Efficient CUDA Implementation of a Tree-Based N-Body...

48
An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos

Transcript of An Efficient CUDA Implementation of a Tree-Based N-Body...

  • An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

    Martin Burtscher

    Department of Computer Science

    Texas State University-San Marcos

  • LLNL

    Mapping Regular Code to GPUs Regular codes

    Operate on array- and matrix-based data structures

    Exhibit mostly strided memory access patterns

    Have relatively predictable control flow (control flow behavior is largely determined by input size)

    Many regular codes are easy to port to GPUs E.g., matrix codes executing many ops/word

    Dense matrix operations (level 2 and 3 BLAS)

    Stencil codes (PDE solvers)

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 2

  • FSU

    Mapping Irregular Code to GPUs Irregular codes

    Build, traverse, and update dynamic data structures (e.g., trees, graphs, linked lists, and priority queues)

    Exhibit data dependent memory access patterns

    Have complex control flow (control flow behavior depends on input values and changes dynamically)

    Many important scientific programs are irregular

    E.g., data clustering, SAT solving, social networks, meshing, …

    Need case studies to learn how to best map irregular codes

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 3

  • Example: N-Body Simulation Irregular Barnes Hut algorithm

    Repeatedly builds unbalanced tree & performs complex traversals on it

    Our implementation Designed for GPUs (not just port of CPU code)

    First GPU implementation of entire BH algorithm

    Results GPU is 21 times faster than CPU (6 cores) on this code

    GPU has better architecture for this irregular algorithm

    NVIDIA GPU Gems book chapter (2011)

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 4

  • Outline Introduction

    Barnes Hut algorithm

    CUDA implementation

    Experimental results

    Conclusions

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 5

    NASA/JPL-Caltech/SSC

  • RUG

    Cornell

    N-Body Simulation Time evolution of physical system

    System consists of bodies

    “n” is the number of bodies

    Bodies interact via pair-wise forces

    Many systems can be modeled in this way Star/galaxy clusters (gravitational force)

    Particles (electric force, magnetic force)

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 6

  • Simple O(n2) n-Body Algorithm Algorithm

    Initialize body masses, positions, and velocities

    Iterate over time steps { Accumulate forces acting on each body

    Update body positions and velocities based on force

    }

    Output result

    More sophisticated n-body algorithms exist Barnes Hut algorithm

    Fast Multipole Method (FMM)

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 7

  • Barnes Hut Idea Precise force calculation

    Requires O(n2) operations (O(n2) body pairs)

    Computationally intractable for large n

    Barnes and Hut (1986) Algorithm to approximately compute forces

    Bodies’ initial position and velocity are also approximate

    Requires only O(n log n) operations

    Idea is to “combine” far away bodies

    Error should be small because force is proportional to 1/distance2

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 8

  • Barnes Hut Algorithm Set bodies’ initial position and velocity

    Iterate over time steps 1. Compute bounding box around bodies

    2. Subdivide space until at most one body per cell Record this spatial hierarchy in an octree

    3. Compute mass and center of mass of each cell

    4. Compute force on bodies by traversing octree Stop traversal path when encountering a leaf (body) or an internal node (cell) that

    is far enough away

    5. Update each body’s position and velocity

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 9

  • Build Tree (Level 1)

    10

    Compute bounding box around all bodies → tree root

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

    *

    * *

    *

    * *

    * *

    * * *

    * * *

    * *

    *

    * *

    *

    * *

    *

    o

  • Build Tree (Level 2)

    11

    Subdivide space until at most one body per cell

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

    * * *

    *

    * *

    * *

    * * *

    * * *

    * *

    *

    * *

    *

    * *

    *

    o o o o

    o

  • Build Tree (Level 3)

    12

    Subdivide space until at most one body per cell

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

    * * *

    *

    * *

    * *

    * * *

    * * *

    * *

    *

    * *

    *

    * *

    *

    o o o o

    o

    o o o o o o o o o o o o

  • Build Tree (Level 4)

    13

    Subdivide space until at most one body per cell

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

    * * *

    *

    * *

    * *

    * * *

    * * *

    * *

    *

    * *

    *

    * *

    *

    o o o o

    o

    o o o o

    o o o o o o o o o o o o

    o o o o o o o o

  • Build Tree (Level 5)

    14

    Subdivide space until at most one body per cell

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

    * * *

    *

    * *

    * *

    * * *

    * * *

    * *

    *

    * *

    *

    * *

    *

    o o o o

    o o o o

    o

    o o o o

    o o o o o o o o o o o o

    o o o o o o o o

  • o o o o

    o o o o

    o

    o o o o

    o o o o o o o o o o o o

    o o o o o o o o

    *

    * *

    *

    * *

    * *

    * * *

    * * *

    o

    * **

    * *

    o *

    * *

    *

    Compute Cells’ Center of Mass

    15

    For each internal cell, compute sum of mass and weighted average of position of all bodies in subtree; example shows two cells only

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

  • o o o o

    o o o o

    o

    o o o o

    o o o o o o o o o o o o

    o o o o o o o o

    Compute Forces

    16

    Compute force, for example, acting upon green body

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

    * * *

    *

    * *

    * *

    * * *

    * * *

    o

    * **

    * *

    o *

    * *

    *

  • o o o o

    o o o o

    o

    o o o o

    o o o o o o o o o o o o

    o o o o o o o o

    Compute Force (short distance)

    17

    Scan tree depth first from left to right; green portion already completed

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

    * * *

    *

    * *

    * *

    * * *

    * * *

    o

    * *

    *

    * *

    o *

    * *

    *

  • o o o o

    o o o o

    o

    o o o o

    o o o o o o o o o o o o

    o o o o o o o o

    Compute Force (down one level)

    18

    Red center of mass is too close, need to go down one level

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

    * * *

    *

    * *

    * *

    * * *

    * * *

    o

    * *

    *

    * *

    o *

    * *

    *

  • o o o o

    o o o o

    o

    o o o o

    o o o o o o o o o o o o

    o o o o o o o o

    Compute Force (long distance)

    19

    Blue center of mass is far enough away

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

    *

    * *

    *

    * *

    * *

    * * *

    * * *

    o

    * *

    *

    * *

    o *

    * *

    *

  • o o o o

    o o o o

    o

    o o o o

    o o o o o o o o o o o o

    o o o o o o o o

    Compute Force (skip subtree)

    20

    Therefore, entire subtree rooted in the blue cell can be skipped

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm

    *

    * *

    *

    * *

    * *

    * * *

    * * *

    o

    * *

    *

    * *

    o *

    * *

    *

  • o o o o

    o o o o

    o

    o o o o

    o o o o o o o o o o o o

    o o o o o o o o

    *

    * *

    *

    * *

    * *

    * * *

    * * *

    * *

    *

    * *

    *

    * *

    *

    *

    * *

    *

    * *

    * *

    * * *

    * * *

    * *

    *

    * *

    *

    * *

    *

    *

    * *

    *

    * *

    * *

    * * *

    * * *

    o

    * **

    * *

    o *

    * *

    *

    *

    * *

    *

    * *

    * *

    * * *

    * * *

    o

    * *

    *

    * *

    o *

    * *

    *

    Pseudocode bodySet = ...

    foreach timestep do {

    bounding_box = new Bounding_Box();

    foreach Body b in bodySet {

    bounding_box.include(b);

    }

    octree = new Octree(bounding_box);

    foreach Body b in bodySet {

    octree.Insert(b);

    }

    cellList = octree.CellsByLevel();

    foreach Cell c in cellList {

    c.Summarize();

    }

    foreach Body b in bodySet {

    b.ComputeForce(octree);

    }

    foreach Body b in bodySet {

    b.Advance();

    }

    }

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 21

  • Complexity and Parallelism bodySet = ...

    foreach timestep do { // O(n log n) + fully ordered sequential

    bounding_box = new Bounding_Box();

    foreach Body b in bodySet { // O(n) parallel reduction

    bounding_box.include(b);

    }

    octree = new Octree(bounding_box);

    foreach Body b in bodySet { // O(n log n) top-down tree building

    octree.Insert(b);

    }

    cellList = octree.CellsByLevel();

    foreach Cell c in cellList { // O(n) + partially ordered bottom-up traversal

    c.Summarize();

    }

    foreach Body b in bodySet { // O(n log n) fully parallel

    b.ComputeForce(octree);

    }

    foreach Body b in bodySet { // O(n) fully parallel

    b.Advance();

    }

    }

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 22

  • Outline Introduction

    Barnes Hut algorithm

    CUDA implementation

    Experimental results

    Conclusions

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 23

  • Efficient GPU Code Coalesced main memory accesses

    Little thread divergence

    Enough threads per block Not too many registers per thread

    Not too much shared memory usage

    Enough (independent) blocks Little synchronization between blocks

    Little CPU/GPU data transfer

    Efficient use of shared memory

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 24

    Thepcreport.net

  • Main BH Implementation Challenges Uses irregular tree-based data structure

    Load imbalance

    Little coalescing

    Complex recursive traversals Recursion not allowed*

    Lots of thread divergence

    Memory-bound pointer-chasing operations Not enough computation to hide latency

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 25

    o o o o

    o o o o

    o

    o o o o

    o o o o o o o o o o o o

    o o o o o o o o

  • Six GPU Kernels Read initial data and transfer to GPU

    for each timestep do { 1. Compute bounding box around bodies

    2. Build hierarchical decomposition, i.e., octree

    3. Summarize body information in internal octree nodes

    4. Approximately sort bodies by spatial location (optional)

    5. Compute forces acting on each body with help of octree

    6. Update body positions and velocities

    }

    Transfer result from GPU and output

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 26

  • objects on heap

    objects in array

    fields in arrays

    Global Optimizations Make code iterative (recursion not supported*)

    Keep data on GPU between kernel calls

    Use array elements instead of heap nodes One aligned array per field for coalesced accesses

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 27

  • c0

    c2 c4 c1

    b5 ba c3 b6 b2 b7 b0 c5

    b3 b1 b8 b4 b9

    bodies (fixed) cell allocation direction

    b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba c5 c4 c3 c2 c1 c0. . .

    Global Optimizations (cont.) Maximize thread count (round down to warp size)

    Maximize resident block count (all SMs filled)

    Pass kernel parameters through constant memory

    Use special allocation order

    Alias arrays (56 B/node)

    Use index arithmetic

    Persistent blocks & threads

    Unroll loops over children

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 28

  • main memory

    threads

    shared memory

    threads

    shared memory

    threads

    shared memory

    threads t1 t2

    shared memory

    threads t1

    shared memory

    barrier

    barrier

    . . .

    warp 1

    barrier

    barrier

    warp 1 warp 2 warp 3 warp 4

    warp 1 warp 2

    Kernel 1: Bounding Box (Regular)

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 29

    Optimizations Fully coalesced

    Fully cached

    No bank conflicts

    Minimal divergence

    Built-in min and max

    2 red/mem, 6 red/bar

    Bodies load balanced

    512*3 threads per SM

    Reduction operation

  • *

    Kernel 2: Build Octree (Irregular) Optimizations

    Only lock leaf “pointers”

    Lock-free fast path

    Light-weight lock release on slow path

    No re-traverse after lock acquire failure

    Combined memory fence per block

    Re-compute position during traversal

    Separate init kernels for coalescing

    512*3 threads per SM

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 30

    Top-down tree building

  • Kernel 2: Build Octree (cont.) // initialize

    cell = find_insertion_point(body); // no locks, cache cell

    child = get_insertion_index(cell, body);

    if (child != locked) { // skip atomic if already locked

    if (child == null) { // fast path (frequent)

    if (null == atomicCAS(&cell[child], null, body)) { // lock-free insertion

    // move on to next body

    }

    } else { // slow path (first part)

    if (child == atomicCAS(&cell[child], child, lock)) { // acquire lock

    // build subtree with new and existing body

    flag = true;

    }

    }

    }

    __syncthreads(); // barrier

    if (threadIdx == 0) __threadfence(); // push data out if L1 cache

    __syncthreads(); // barrier

    if (flag) { // slow path (second part)

    cell[child] = new_subtree; // insert subtree and release lock

    // move on to next body

    }

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 31

  • 4

    3 1 2

    3 4

    allocation direction

    3 4 1 2 3 4

    scan direction

    . . .

    Kernel 3: Summarize Subtrees (Irregular)

    Bottom-up tree traversal

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 32

    Optimizations Scan avoids deadlock

    Use mass as flag + fence No locks, no atomics

    Use wait-free first pass

    Cache the ready information

    Piggyback on traversal Count bodies in subtrees

    No parent “pointers”

    128*6 threads per SM

  • 4

    3 1 2

    3 4

    allocation direction

    3 4 1 2 3 4

    scan direction

    . . .

    Kernel 4: Sort Bodies (Irregular)

    Top-down tree traversal

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 33

    Optimizations (Similar to Kernel 3) Scan avoids deadlock Use data field as flag

    No locks, no atomics

    Use counts from Kernel 3 Piggyback on traversal

    Move nulls to back

    Throttle flag polling requests with optional barrier

    64*6 threads per SM

  • Kernel 5: Force Calculation (Irregular)

    Multiple prefix traversals

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 34

    Optimizations Group similar work together

    Uses sorting to minimize size of prefix union in each warp

    Early out (nulls in back)

    Traverse whole union to avoid divergence (warp voting)

    Lane 0 controls iteration stack for entire warp (fits in shared mem)

    Avoid unneeded volatile accesses Cache tree-level-based data 256*5 threads per SM

  • Architectural Support Coalesced memory accesses & lockstep execution

    All threads in warp read same tree node at same time

    Only one mem access per warp instead of 32 accesses

    Warp-based execution Enables data sharing in warps without synchronization

    RSQRTF instruction Quickly computes good approximation of 1/sqrtf(x)

    Warp voting instructions Quickly perform reduction operations within a warp

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 35

  • main memory

    threads

    main memory

    warp 2

    . . .

    . . .

    . . .

    warp 1 warp 2 warp 3 warp 4 warp 1 warp 2

    Kernel 6: Advance Bodies (Regular)

    Straightforward streaming

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 36

    Optimizations Fully coalesced, no divergence

    Load balanced, 1024*1 threads per SM

  • Outline Introduction

    Barnes Hut algorithm

    CUDA implementation

    Experimental results

    Conclusions

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 37

    1

    10

    100

    1000

    10000

    100000

    1000000

    10000000

  • Evaluation Methodology Implementations

    CUDA: irregular Barnes Hut & regular O(n2) algorithm OpenMP: Barnes Hut algorithm (derived from CUDA code) Pthreads: Barnes Hut algorithm (from SPLASH-2 suite)

    Systems and compilers nvcc 4.0 (-O3 -arch=sm_20 -ftz=true*)

    GeForce GTX 480, 1.4 GHz, 15 SMs, 32 cores per SM gcc 4.1.2 (-O3 -fopenmp* -ffast-math*)

    Xeon X5690, 3.46 GHz, 6 cores, 2 threads per core

    Inputs and metric 5k, 50k, 500k, and 5M star clusters (Plummer model) Best runtime of three experiments, excluding I/O

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 38

  • Nodes Touched per Activity (5M Input) Kernel “activities”

    K1: pair reduction

    K2: tree insertion

    K3: bottom-up step

    K4: top-down step

    K5: prefix traversal

    K6: integration step

    Max tree depth ≤ 22

    Cells have 3.1 children

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 39

    Prefix ≤ 6,315 nodes (≤ 0.1% of 7.4M)

    BH algorithm & sorting to minimize size of prefix union work well

    min avg max

    kernel 1 1 2.0 2

    kernel 2 2 13.2 22

    kernel 3 2 4.1 9

    kernel 4 2 4.1 9

    kernel 5 818 4,117.0 6,315

    kernel 6 1 1.0 1

    neighborhood size

  • 1

    10

    100

    1,000

    10,000

    100,000

    1,000,000

    10,000,000k1

    .1

    k1.2

    k1.3

    k1.4

    k1.5

    k1.6

    k1.7

    k1.8

    k1.9

    k1

    .10

    k1

    .11

    k1

    .12

    k1

    .13

    k1

    .14

    k1

    .15

    k1

    .16

    k1

    .17

    k1

    .18

    k1

    .19

    k1

    .20

    k1

    .21

    k1

    .22

    k1

    .23

    k2.1

    k2.2

    k2.3

    k2.4

    k2.5

    k2.6

    k2.7

    k2.8

    k2.9

    k2

    .10

    k2

    .11

    k2

    .12

    k2

    .13

    k2

    .14

    k2

    .15

    k2

    .16

    k2

    .17

    k2

    .18

    k2

    .19

    k3.1

    k3.2

    k3.3

    k3.4

    k3.5

    k3.6

    k3.7

    k3.8

    k3.9

    k3

    .10

    k3

    .11

    k3

    .12

    k3

    .13

    k3

    .14

    k3

    .15

    k3

    .16

    k3

    .17

    k3

    .18

    k3

    .19

    k3

    .20

    k4.1

    k4.2

    k4.3

    k4.4

    k4.5

    k4.6

    k4.7

    k4.8

    k4.9

    k4

    .10

    k4

    .11

    k4

    .12

    k4

    .13

    k4

    .14

    k4

    .15

    k4

    .16

    k4

    .17

    k4

    .18

    k4

    .19

    k4

    .20

    k5.1

    k6.1

    avai

    lab

    le p

    aral

    lelis

    m [

    ind

    ep

    en

    de

    nt

    acti

    viti

    es]

    kernel number and iteration

    5,000 bodies

    50,000 bodies

    500,000 bodies

    5,000,000 bodies

    Available Amorphous Data Parallelism

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 40

    bounding box tree building summarization sorting

    forc

    e c

    alc

    .

    inte

    gra

    tio

    n

  • Runtime Comparison GPU BH inefficiency

    5k input too small for 5,760 to 23,040 threads

    BH vs. O(n2) algorithm Regular O(n2) code is faster with

    fewer than about 15,000 bodies

    GPU vs. CPU (5M input) 21.1x faster than OpenMP

    23.2x faster than Pthreads

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 41

    0

    1

    10

    100

    1,000

    10,000

    100,000

    5k 50k 500k 5M

    run

    tim

    e p

    er

    tim

    est

    ep

    [m

    s]

    number of bodies

    GPU CUDA

    GPU O(n^2)

    CPU OpenMP

    CPU Pthreads

  • kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6 BarnesHut O(n2)

    Gflops/s 71.6 5.8 2.5 n/a 240.6 33.5 228.4 897.0

    GB/s 142.9 26.8 10.6 12.8 8.0 133.9 8.8 2.8

    runtime [ms] 0.4 44.6 28.0 14.2 1641.2 2.2 1730.6 557421.5

    kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6 kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6

    X5690 CPU 5.5 185.7 75.8 52.1 38,540.3 16.4 10.3 193.1 101.0 51.6 47,706.4 33.1

    GTX 480 GPU 0.4 44.6 28.0 14.2 1,641.2 2.2 0.8 46.7 31.0 14.2 7,714.6 4.2

    CPU/GPU 13.1 4.2 2.7 3.7 23.5 7.3 12.7 4.1 3.3 3.6 6.2 7.9

    non-compliant fast single-precision version IEEE 754-compliant double-precision version

    Kernel Performance for 5M Input $400 GPU delivers 228 GFlops/s on irregular code

    GPU chip is 2.7 to 23.5 times faster than CPU chip

    GPU hardware is better suited for running BH than CPU hw is But more difficult and time consuming to program

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 42

  • avoid rsqrtf recalc. thread full multi-

    volatile instr. data voting threading

    50,000 1.14x 1.43x 0.99x 2.04x 20.80x

    500,000 1.19x 1.47x 1.32x 2.49x 27.99x

    5,000,000 1.18x 1.46x 1.69x 2.47x 28.85x

    throttling waitfree combined sorting of sync'ed

    barrier pre-pass mem fence bodies execution

    50,000 0.97x 1.02x 1.54x 3.60x 6.23x

    500,000 1.03x 1.21x 1.57x 6.28x 8.04x

    5,000,000 1.04x 1.31x 1.50x 8.21x 8.60x

    Kernel Speedups Optimizations that are generally applicable

    Optimizations for irregular kernels

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 43

  • Outline Introduction

    Barnes Hut algorithm

    CUDA implementation

    Experimental results

    Conclusions

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 44

  • Optimization Summary Minimize thread divergence

    Group similar work together, force synchronicity

    Reduce main memory accesses Share data within warp, combine memory fences & traversals,

    re-compute data, avoid volatile accesses

    Implement entire algorithm on and for GPU Avoid data transfers & data structure inefficiencies, wait-free pre-pass,

    scan entire prefix union

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 45

  • Optimization Summary (cont.) Exploit hardware features

    Fast synchronization and thread startup, special instructions, coalesced memory accesses, even lockstep execution

    Use light-weight locking and synchronization Minimize locks, reuse fields, use fence + store ops

    Maximize parallelism Parallelize every step within and across SMs

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 46

  • Conclusions Irregularity does not necessarily prevent high-performance

    on GPUs Entire Barnes Hut algorithm implemented on GPU

    Builds and traverses unbalanced octree

    GPU is 22.5 times (float) and 6.2 times (double) faster than high-end hyperthreaded 6-core Xeon

    Code directly for GPU, do not merely adjust CPU code Requires different data and code structures

    Benefits from different algorithmic modifications

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 47

  • Acknowledgments

    Collaborators Ricardo Alves (Universidade do Minho, Portugal)

    Molly O’Neil (Texas State University-San Marcos)

    Keshav Pingali (University of Texas at Austin)

    Hardware and funding NVIDIA Corporation

    An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm 48