Implementation and Benchmarking of Two-Dimensional Vortex Interactions on a Graphics Processing Unit

14
Implementation and Benchmarking of Two-Dimensional Vortex Interactions on a Graphics Processing Unit Christopher C. Chabalko and Balakumar Balachandran University of Maryland, College Park, Maryland 20742 DOI: 10.2514/1.I010141 A parallel single-instruction multiple data implementation of a two-level nested loop, which uses shared memory, is implemented via general-purpose computing on a graphics processing unit. The general-purpose computing on a graphics processing unit implementation is compared to MATLAB®, C, and other implementations of the same algorithm, which are primarily executed on the central processing unit. The general-purpose computing on a graphics processing unit implementation is determined to be decisively faster (80 times) than the fastest single threaded implementation. A linear algebra implementation is determined to consume excessive memory without a corre- sponding increase in computational performance. Although the speedup is hardware dependent, the general-purpose computing on a graphics processing unit algorithm exploits cache memory in a manner that is severely constrained on conventional multicore central processing units. For this reason, the nested loop described here is a natural fit for the single-instruction multiple data shared memory architecture. Details of the implementation are provided. The algorithm is applied to the simulation of vortex dynamics. In particular, it is applied to simulate the rollup of a vortex filament and carry out an unsteady simulation of a thin plate in ground effect. The cases presented here would be intractable to compute without the acceleration offered by the general-purpose computing on a graphics processing unit. Nomenclature N = number of vortices, domain size n = unit normal r = radial distance of a point r = magnitude of distance jrj r Cut = rigid-body cutoff distance V = Cartesian velocity with components V x ;V y V B = body velocity V θ = scalar velocity in the tangential direction x r = x Cartesian coordinate of receiving point x s = x Cartesian coordinate of sending point y r = y Cartesian coordinate of receiving point y s = y Cartesian coordinate of sending point Γ = circulation strength Δt = time step I. Introduction T HE single-instruction multiple data (SIMD) computing architecture has been a popular choice for scientific computing applications since the first vector supercomputers in the 1970s [1]. The practice of general-purpose computing on a graphics processing unit (GPGPU) allows large-scale SIMD computations on current hardware [2]. A common type of SIMD implementation referred to as shared memory architecture allows efficient computation of certain algorithms where data must be shared by multiple processors. The particular cache structure associated with shared memory architecture can allow efficient data sharing, which is severely constrained on mainstream central processing units (CPUs). While most CPU architectures have moved from a dedicated cache to a shared cache, the shared cache may not be available to all cores, and the number of cores remains significantly smaller than the number available on GPGPU systems [3]. A two-level nested loop, where the contributions from the potential field associated with N bodies are computed on M other bodies, is a common algorithm in scientific computing. This has been applied to the simulation of gravitational bodies [4], molecular dynamics [5], crowd dynamics [6], and many other systems. In the present work, vortex dynamics are simulated by representing vortices of varying strength as the N bodies. Then, the self-convection of a vortex filament can be modeled by N vortices acting on themselves. Such an N-body algorithm is also responsible for a significant portion of the expense in computing the wake convection routine of the unsteady vortex lattice method (UVLM) algorithm in two and three dimensions [7]. While the computational power of desktop computers has continued to increase, the need for discrete representations including reduced-order models remains an important part of the aerospace design and educational process. The two-dimensional UVLM formulation has been the cornerstone of preliminary airfoil design and has been applied in many significant areas, including flapping-wing flight [8,9] and biological flows [10]. An example configuration used in the two-dimensional unsteady vortex lattice method implementation is shown in Fig. 1. The body consists of bound vortices. The shed wake, denoted as Free Latticein the figure, is composed of vortices, for which the number grows by one with each iteration. The formulation can Presented as Paper 2013-0288 at the 51st AIAA Aerospace Sciences Meeting Iincluding the New Horizons Forum and Aerospace Exposition, Dallas, TX, 710 January 2013; received 11 June 2013; revision received 21 November 2013; accepted for publication 20 December 2013; published online 6 June 2014. Copyright © 2013 by Christopher Chabalko and Balakumar Balachandran. Published by the American Institute of Aeronautics and Astronautics, Inc., with permission. Copies of this paper may be made for personal or internal use, on condition that the copier pay the $10.00 per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923; include the code 2327-3097/14 and $10.00 in correspondence with the CCC. *Research Assistant Professor, Department of Mechanical Engineering. Senior Member AIAA. Minta Martin Professor, Department Chair, Mechanical Engineering. Fellow AIAA. 372 JOURNAL OF AEROSPACE INFORMATION SYSTEMS Vol. 11, No. 6, June 2014 Downloaded by PENNSYLVANIA STATE UNIVERSITY on August 11, 2014 | http://arc.aiaa.org | DOI: 10.2514/1.I010141

Transcript of Implementation and Benchmarking of Two-Dimensional Vortex Interactions on a Graphics Processing Unit

Page 1: Implementation and Benchmarking of Two-Dimensional Vortex Interactions on a Graphics Processing Unit

Implementation and Benchmarking of Two-Dimensional VortexInteractions on a Graphics Processing Unit

Christopher C. Chabalko∗ and Balakumar Balachandran†

University of Maryland, College Park, Maryland 20742

DOI: 10.2514/1.I010141

Aparallel single-instructionmultiple data implementation of a two-level nested loop, which uses sharedmemory, is

implemented via general-purpose computing on a graphics processing unit. The general-purpose computing on a

graphics processing unit implementation is compared to MATLAB®, C, and other implementations of the same

algorithm,which are primarily executedon the central processingunit. The general-purpose computingona graphics

processing unit implementation is determined to be decisively faster (80 times) than the fastest single threaded

implementation. A linear algebra implementation is determined to consume excessive memory without a corre-

sponding increase in computational performance. Although the speedup is hardware dependent, the general-purpose

computing on a graphics processing unit algorithmexploits cachememory in amanner that is severely constrained on

conventional multicore central processing units. For this reason, the nested loop described here is a natural fit for the

single-instruction multiple data shared memory architecture. Details of the implementation are provided. The

algorithm is applied to the simulation of vortex dynamics. In particular, it is applied to simulate the rollup of a vortex

filament and carry out an unsteady simulation of a thin plate in ground effect. The cases presented here would be

intractable to compute without the acceleration offered by the general-purpose computing on a graphics processing

unit.

Nomenclature

N = number of vortices, domain sizen = unit normalr = radial distance of a pointr = magnitude of distance jrjrCut = rigid-body cutoff distanceV = Cartesian velocity with components �Vx; Vy�VB = body velocityVθ = scalar velocity in the tangential directionxr = x Cartesian coordinate of receiving pointxs = x Cartesian coordinate of sending pointyr = y Cartesian coordinate of receiving pointys = y Cartesian coordinate of sending pointΓ = circulation strengthΔt = time step

I. Introduction

T HE single-instructionmultiple data (SIMD) computing architecture has been a popular choice for scientific computing applications since thefirst vector supercomputers in the 1970s [1]. The practice of general-purpose computing on a graphics processing unit (GPGPU) allows

large-scale SIMD computations on current hardware [2]. A common type of SIMD implementation referred to as shared memory architectureallows efficient computation of certain algorithms where data must be shared by multiple processors. The particular cache structure associatedwith shared memory architecture can allow efficient data sharing, which is severely constrained on mainstream central processing units (CPUs).While most CPU architectures have moved from a dedicated cache to a shared cache, the shared cache may not be available to all cores, and thenumber of cores remains significantly smaller than the number available onGPGPU systems [3]. A two-level nested loop, where the contributionsfrom the potential field associated withN bodies are computed onM other bodies, is a common algorithm in scientific computing. This has beenapplied to the simulation of gravitational bodies [4], molecular dynamics [5], crowd dynamics [6], and many other systems. In the present work,vortex dynamics are simulated by representing vortices of varying strength as theN bodies. Then, the self-convection of a vortex filament can bemodeled byN vortices acting on themselves. Such anN-body algorithm is also responsible for a significant portion of the expense in computingthe wake convection routine of the unsteady vortex lattice method (UVLM) algorithm in two and three dimensions [7]. While the computationalpower of desktop computers has continued to increase, the need for discrete representations including reduced-ordermodels remains an importantpart of the aerospace design and educational process. The two-dimensional UVLM formulation has been the cornerstone of preliminary airfoildesign and has been applied in many significant areas, including flapping-wing flight [8,9] and biological flows [10]. An example configurationused in the two-dimensional unsteady vortex lattice method implementation is shown in Fig. 1. The body consists of bound vortices. The shedwake, denoted as “FreeLattice” in the figure, is composed of vortices, forwhich the number grows by onewith each iteration. The formulation can

Presented as Paper 2013-0288 at the 51st AIAAAerospace SciencesMeeting Iincluding the New Horizons Forum and Aerospace Exposition, Dallas, TX, 7–10January 2013; received 11 June 2013; revision received 21November 2013; accepted for publication 20December 2013; publishedonline 6 June 2014.Copyright©2013 byChristopher Chabalko andBalakumarBalachandran. Published by theAmerican Institute ofAeronautics andAstronautics, Inc., with permission. Copies ofthis paper may bemade for personal or internal use, on condition that the copier pay the $10.00 per-copy fee to the Copyright Clearance Center, Inc., 222 RosewoodDrive, Danvers, MA 01923; include the code 2327-3097/14 and $10.00 in correspondence with the CCC.

*Research Assistant Professor, Department of Mechanical Engineering. Senior Member AIAA.†Minta Martin Professor, Department Chair, Mechanical Engineering. Fellow AIAA.

372

JOURNAL OF AEROSPACE INFORMATION SYSTEMS

Vol. 11, No. 6, June 2014

Dow

nloa

ded

by P

EN

NSY

LV

AN

IA S

TA

TE

UN

IVE

RSI

TY

on

Aug

ust 1

1, 2

014

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/1

.I01

0141

Page 2: Implementation and Benchmarking of Two-Dimensional Vortex Interactions on a Graphics Processing Unit

be extended tomodel closed bodies, or thick cambered airfoils, multiple airfoils [11], and airfoils flying in ground effect [12]. In three dimensions,the method has been used to determine the preliminary design loads for a variety of wings, complete aircraft, and even wind turbines [13].

For an optimizedN-body computation, one of the original applications is the modeling of the mutual gravitational attraction ofN bodies in thefield of astrophysics. A purpose-built processor referred to as the GRAPE (gravity pipe) was created specifically to address the computationsinvolved in the N-body simulation [4]. In this case, the algorithm was hard-wired into the topology of the chips. The fundamental algorithmimplemented in the hardware, namely, the tile-and-loop approach (described next), has since served as a prototype for many implementations.This approach was adapted to GPGPU by Hamada and Iitaka in the chamomile scheme [14]. The authors broke the domain into tractable tiles of512 particles, stored the values in shared memory, computed the influence of each particle on a receiver, and repeated. In a more recent effort,Nyland, Harris, and Prins implemented a tile and loop approach to compute the gravitational attraction of particles in three dimensions usingCUDA and more modern techniques in [15], Chapter 31. In each case, single precision values were used.

Two notable complements to the direct summationN-body algorithm are the fast multipole method and Ewald summation. These methods areapproximations to the direct calculation and contain a user-adjustable parameter to vary their accuracy level. In addition, they require greaterprogramming effort for implementation. Although their asymptotic complexity may be lower than that for direct summation, they each requiresignificant initialization and are more appropriate for extremely large domain sizes. Their performance is highly implementation and problemdependent, and must be benchmarked against the standard direct summation. This work will focus on benchmarking the direct summation [16].

Many other efforts have accelerated fluid dynamics simulations with the graphics processing unit (GPU). Grid-based Eulerian efforts, such asthe one described in [17], can directly take advantage of GPGPU linear algebra accelerations. Aside from directly applying linear algebraaccelerations, solver-specific custom-written routines are typically required. With more complicated interactions, such as the domaindecomposition associated with tree codes, it may be beneficial to split the work between the GPU and CPU to take advantage of the benefits ofeach platform. Stock andGharakhani used this approach in [18], assigningmost of the computations in their boundary elementmethod to theGPUbut reserving computations involving sparse domains for the CPU.

The present effort is presented in the context of the computational environment ofmobile discrete interactions. This fieldwas recently identifiedby the authors as a natural choice for the GPU. N-body systems with short-range and long-range interactions, including the discrete elementmethod, smoothed particle hydrodynamics, and vortex simulations, all naturally fit into this framework [19]. Building upon the previous work byChabalko et al. [20], a total of nine different implementations are examined, including two versions based on linear algebraic manipulations that,to the best of the authors’ knowledge, have not been benchmarked in a similar context. Source code is provided for eight of the implementationsstudied. Furthermore, in this work, the authors use double precision for all physical values. The GPGPU implementation makes use of moremodern hardware and software than that used in the previously mentioned references, and this implementation should be more relevant to theinterested reader. It is shown that the tile-and-loop approach fits well into the current GPGPU architecture, and it offers significant accelerationover more traditional implementations.

This work is particularly relevant since desktop processing and supercomputing are increasingly turning toward multiple-core computing and,in particular, SIMD architecture via GPGPU. In this paradigm, hundreds or thousands of cores are used, at moderate clock speed, to performconcurrent computations. Algorithms must be recast to take advantage of the patterned memory access and cache structure available on the newhardware. Several attempts have been made to automatically convert existing algorithms; however, these methods continue to perform well onlyfor certain cases [21]. Furthermore, it is beneficial for all computations to take place on the GPU in order to gain the full potential of acceleration,whichmay further complicate an automatedmethod [22].Modern compilers have built in auto-optimizers that can automatically identify pieces ofcode that are then converted to optimized machine code [23]. This optimized code makes use of cache hierarchy and advanced architecture setextensions to some degree. However, the algorithmic dependence of such conversion requires something akin to artificial intelligence beimplemented as a part of automatic conversion [24]. This implies that hand-coded algorithms will be required as a benchmark for the near future.

The N-body computation can be expressed as a set of nested loops or as a series of linear algebraic operations. The nested loop version wasimplemented in Matlab, hand coded single threaded C, hand coded multithreaded C via open multiprocessing (OpenMP), and hand coded as aGPGPU implementation referred to as an nVidia CUDA kernel. The linear algebraic operations were implemented in MATLAB and nVidia’sCUDA through the MATLAB Parallel Computing Toolbox (PCT). All versions were benchmarked, and the hand-coded CUDA kernel wasdetermined to be significantly faster than the other implementations. This implementation is demonstrated to maintain its speed advantage as thedomain size is scaled. Furthermore, the GPGPU paradigm ensures that this implementation will scale with new hardware: that is, increased corecount, shared memory size, and clock speed. Details of the implementation are provided in this work. While the application considered here is aspecific variation on the Biot–Savart law tomodel vortex interactions, the synergistic SIMD implementation applies to any “all pairs” nested loopalgorithm.

II. Two-Dimensional Vortex Interactions

The Biot–Savart law was originally developed to model the magnetic field generated by electric current [25]. The law has found many otherapplications in physical modeling. Among them is the description of the irrotational velocity field around a point vortex. A complete derivation,beginning with the Navier–Stokes equations, can be found in [7]. The velocity field induced by a vortex of strength Γ is defined as

Fig. 1 Illustration of a body and wake modeled by the vortex lattice method.

CHABALKO AND BALACHANDRAN 373

Dow

nloa

ded

by P

EN

NSY

LV

AN

IA S

TA

TE

UN

IVE

RSI

TY

on

Aug

ust 1

1, 2

014

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/1

.I01

0141

Page 3: Implementation and Benchmarking of Two-Dimensional Vortex Interactions on a Graphics Processing Unit

Vθ �� −Γ

2πr r ≥ rcut−Γr2πr2cut

r < rcut(1)

and acts in a purely tangential direction to the vortex, as shown in Fig. 2. For r > rcut, referred to as the far field, the velocity field can be modeledby theBiot–Savart law as its strength is proportional to 1∕r. For r < rcut, referred to as the near field, the velocity field is consistent with rigid-bodyrotation and its strength is proportional to r in order to model the vortex core. The algorithmic formulation of this relationship will be examinednext. The influence of multiple vortices on a given point can be modeled by linear superposition of the velocity fields.

III. Algorithmic Implementations

In thisN-body application, the influence ofN vortices onmany other pointsmust be computed. This computation can be primarily carried out intwo different ways. In each case, the number of senders is identified by nS, and the number of receivers is identified by nR. The Biot–Savart law istypically implemented in Cartesian coordinates. A computationally efficient relationship for expressing Eq. (1) in terms of �Vx; Vy�, wherein therigid-body vortex core is incorporated, is shown as follows:

Δx � xr − xs (2)

Δy � yr − ys (3)

jrj2 � Δx2 � Δy2 (4)

Near field:

Vx �ΓΔy2πr2cut

(5)

Vy �−ΓΔx2πr2cut

(6)

Far field:

Vx �ΓΔy2πjrj2 (7)

Vy �−ΓΔx2πjrj2 (8)

A. Looping Implementation

The first implementation examined is a nested loop, where each i represents a sending point, typically referred to as the sender; and each jrepresents a receiving point, typically referred to as the receiver. The computation of the velocity field invoked byN senders onN receivers is ofasymptotic computational complexity O�N2�:

-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

x

y

local velocitylocus of velocity at y=0

Fig. 2 Velocity field induced by a single vortex of unit strength located at the origin. Note the region of rigid-body rotation at distances closer than rcut tothe vortex core.

374 CHABALKO AND BALACHANDRAN

Dow

nloa

ded

by P

EN

NSY

LV

AN

IA S

TA

TE

UN

IVE

RSI

TY

on

Aug

ust 1

1, 2

014

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/1

.I01

0141

Page 4: Implementation and Benchmarking of Two-Dimensional Vortex Interactions on a Graphics Processing Unit

Vj �XnSi�1

VBiot-Savart�rj − ri;Γi� j � �1; 2; 3; : : : ; nR� (9)

The implementation of Eq. (9), where multiple senders and multiple receivers exist, involves repeating the preceding set of steps in a nested loop.A straightforward implementation of the nested loop is shown in the listing in Fig. 3. In this case, the senders are looped over in the outer loop, andthe receivers are looped over in the inner loop. Inside the inner loop, the distance between the sender and the receiver is computed. This distance isthen used in the computation of the influence of that particular sender on a given receiver. The influence on all the receivers Vx and Vy areaccumulated (summed) in the inner loop.At the end of the inner loop, a new receiver is loaded and the process is repeated. Several variations on thelooping code presented in Fig. 3 are examined in the benchmarking section. For a general algorithm, a symmetry condition cannot beimplemented, as is occasionally seen in the literature, since the sending and receiving points may have different strengths.

B. Linear Algebra Implementation

The same relationships can be expressed as a combination of linear algebraic operations. In this case, the x coordinate of the receivers isexpressed as a column vector. Recalling that the number of senders is nS, this is multiplied with a row vector of ones with nS elements. Theresulting matrix contains the x coordinates of the receivers replicated down the columns, and this matrix has the form

�xr� �

2666664

xr1xr2xr3...

xrnR

3777775� � 1 · · · 1 � �

2666664

xr1 xr1 · · · · · · xr1xr2 xr2 · · · · · · xr2... . .

.

. ..

xrnR xrnR · · · · · · xrnR

3777775

(10)

Through a similar procedure, amatrix of the x coordinates of the senders is created bymultiplying a columnvector of oneswithnR elementswith arow vector containing the x coordinates of the senders as

�xs� �

24 1

..

.

1

35 � � xs1 xs2 xs3 · · · xsnS � �

2666664

xs1 xs2 · · · xsnSxs1 xs2 · · · xsnS... . .

.

. ..

xs1 xs2 · · · xsnS

3777775

(11)

The resultingmatrix contains a copy of the x coordinates of the receivers along the rows. An array ofΔx can be computed by subtracting �xr� from�xs� as

�Δx� � �xr� − �xs� �

26666664

xr1 − xs1 xr1 − xs2 · · · xr1 − xsnSxr2 − xs1 xr2 − xs2 · · · ..

.

..

. . ..

. ..

xrnR − xs1 xrnR − xsnS

37777775

(12)

A similar procedure is used to compute an array for �Δy�. The �Δx� and �Δy� arrays are treated elementwise to form the �Δr2� array as

Fig. 3 Portion of the MATLAB implementation of the Biot–Savart law for unique senders and receivers.

CHABALKO AND BALACHANDRAN 375

Dow

nloa

ded

by P

EN

NSY

LV

AN

IA S

TA

TE

UN

IVE

RSI

TY

on

Aug

ust 1

1, 2

014

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/1

.I01

0141

Page 5: Implementation and Benchmarking of Two-Dimensional Vortex Interactions on a Graphics Processing Unit

�Δr2� � �Δx2� � �Δy2� (13)

The individual values of �Δr2� are compared to the near-field cutoff condition and adjusted accordingly. The division of �Δy� (or �Δx�) over �Δr iscarried out elementwise, maintaining thematrix dimension. Finally, the velocity influence is computedwith amatrix vectormultiplication, wherethe array fΓg contains the vortex strengths of the senders as

fVxg �1

�ΔyΔr2

�fΓg (14)

and

fVyg �−12π

�ΔxΔr2

�fΓg (15)

While this method may seem preferable due to the explicit reliance on linear algebra, the memory requirements vastly exceed those of a nestedloop implementation. The memory consumption not only consumes resources, but memory access also increases computational time. The arraysrequired for the computation contain a great deal of replication. The unnecessary memory replication is suggestive of why this may not be anoptimal algorithm.

An estimate of the memory requirements for each method is shown in Table 1. For these estimates, double precision values of all quantities areconsidered.An equal number of senders and receivers are considered, and thevarious domain sizes considered are shown in the first columnof thistable. The loopingmethod requires a two-coordinate array of senders, a two-coordinate array of receivers, an array of vortex strengths, and a two-coordinate array of the velocity influence. This results in seven N elements. The linear algebra method requires two square arrays for the twocoordinates of the sending elements, two square arrays for the two coordinates of the receiving elements, an array of vortex strengths, a squarearray for the distance between senders and receivers, and a two-coordinate array of the velocity influence. This results in 5N2 � 3N elements. Thelinear algebra-based implementation becomes quite expensive in terms of memory as the domain size is increased. These estimates serve as astarting point, as memory usage can be optimized through more sophisticated implementations.

IV. Computational Implementations

A total of nine variations on the computational implementations were examined. These variations are listed in Table 2 and described next. Thevariations are listed in the order of execution speed, from the slowest to the fastest, in order to facilitate the analysis portion later in this effort.

In the first implementation, which is referred to as “MATLAB, inner loop senders,” the senders are placed in the inner loop, and the receivers areplaced in the outer loop, as shown in Fig. A1. The next implementation, which is referred to as “MATLAB, inner loop receivers,” is the standard towhich all other implementations have been compared. This implementation is useful for a beginner, and it is representative of what may beattempted as a first iteration in a research or prototyping environment. In this case, the receivers are updated in the inner loop. Since the outer loopcontains the senders, G must only be updated when the outer loop is incremented, as shown in Figs. 3 and A2.

In the third implementation, which is labeled as “MATLAB linear algebra,” the linear algebra scheme is implemented as computations, asdescribed in Sec. III.B and shown in Fig. A3. The computational implementation follows the mathematical description of Sec. III.B quite closely,including efficient row and column replications. In the final step, theMATLABmatrixmultiplication operator is applied to sum the influence of allthe senders over all of the receivers.

Table 1 Memory requirements for the Biot–Savart law computation based on nested loops and

linear algebra operationsa

No. vortices Loops Arrays

1000 54.69 KiB 38.17 MiB2000 109.38 KiB 152.63 MiB4000 218.75 KiB 610.44 MiB8000 437.5 KiB 2.38 GiB16,000 875 KiB 9.54 GiB32,000 1.71 MiB 38.15 GiB

aKiB denotes kibibytes, MiB denotes mebibytes, and GiB denotes

gibibytes.

Table 2 List of implementations of theBiot–Savart Law

No. Implementation name Symbol

1 MATLAB, inner loop senders2 MATLAB, inner loop receivers3 MATLAB linear algebra4 MATLAB PCT linear algebra5 C, inner loop senders6 C, inner loop receivers7 OpenMP C, inner loop senders8 OpenMP C, inner loop receivers9 MATLAB PCT CUDA kernel

376 CHABALKO AND BALACHANDRAN

Dow

nloa

ded

by P

EN

NSY

LV

AN

IA S

TA

TE

UN

IVE

RSI

TY

on

Aug

ust 1

1, 2

014

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/1

.I01

0141

Page 6: Implementation and Benchmarking of Two-Dimensional Vortex Interactions on a Graphics Processing Unit

In the fourth implementation, which is referred to as “MATLAB PCT linear algebra,” a GPGPU card is used to perform the linear algebraicoperations. This implementation and code are similar to that of the third implementation, as shown in Fig. A4; however, all operations take placeon the GPU. The arrays of senders, receivers, and all other required quantities have been created on the GPU by using the gpuArray() MATLABcommand. InMATLAB, operator overloading is automatically applied to perform the operations on the GPU. This method suffers from the sameshortcomings as the CPU implementation of the linear algebra implementation. Namely, the data replication is a major bottleneck, and thememory requirements only allow some of the smallest cases to be run. In other benchmarking applications, the authors observed a maximum of a10x speed increase in the matrix vector product on the GPU via the MATLAB PCTas compared to the same operation on CPU. Since the finaloperation in the algorithm is exactly amatrix vector product, the speedup of this implementation is significantly constrained. In reality, the speed ismore significantly constrained by the data replication and other operations,making this implementation undesirable in terms of speed of executionand memory requirements.

The fifth implementation is the handwritten C code, referred to as “C, inner loop senders,” with the senders in the inner loop, as shown inFig. A5. This algorithm is almost identical to the first implementationwith header and footer codes, which allow the algorithm to be compiledwithMATLAB’smex command. The sixth one, referred to as “C, inner loop receivers,” is also handwritten C code, and this one is similar to the secondone, with the receivers in the inner loop, as shown in Fig. A6.

The seventh implementation, referred to as “OpenMP C, inner loop senders,” is similar to the fifth one; however, multithreaded support wasapplied by usingOpenMPdirectives, as shown in Fig. A7.OpenMP is described as a simple and effectiveway to apply parallelism to a given code.In the case of an OpenMP for loop, the loop is automatically broken up into K independent segments, where K is the number of available cores.Since the receivers remain in the outer loop, the outer loopwas targeted for parallelization. Each thread is responsible for part of the outer loop thatrequires the #pragma statement to be placed at the top of the outer loop. Like all parallel environments, improper application of OpenMPcommands can result in race conditions and incorrect computations. In this case, placing the #pragma statement just above the inner loop can causesuch a condition if no other precautions, such as specifying shared() and private() variables, are taken.

In the eighth implementation, referred to as “OpenMPC inner loop receivers,” the receivers are placed in the inner loopwhile applyingOpenMPto use multiple threads, as shown in Fig. A8. This allows each thread (core) to modify an exclusive set of receivers, and it mitigates the possibilityof multiple threads writing to the same location.While the placement of the OpenMP statement may imply that threads are created and destroyedafter each iteration of the outer loop, in fact, the threads are spawned once and parked or repurposed each time the outer loop is cycled [26]. Thebenchmarking results support this as an efficient use of the OpenMP application programming interface (API) for this application. Placing the#pragma statement at the top of the outer loop would allow the influence of only 1∕K of the senders to act on the receivers, since the other�K − 1�∕K would be handled by different threads. Again, explicit use of shared() and private() variables would mitigate this inaccuracy, but theywould also negatively influence computational performance.

Finally, the ninth implementation, referred to as “MATLABPCTCUDAkernel,” is a hand-codedCUDAkernel implementation of the tile-and-loop approach, which takes full advantage of shared memory and other modern features of the GPU. The CUDA code is compiled to parallelthread execution (.ptx) code with Nvidia CUDA compiler and accessed in MATLAB by calling the feval(kernel, : : : ) command. Each thread isresponsible for accumulating the influence of all senders on a single receiver. This configuration avoids race conditions and the need for atomicoperations, since each thread is used only to perform a read–modify–write on its own data. A tile of threads loads a tile of senders into sharedmemory. Each thread then loops over the tile of senders and accumulates the influence of all senders on its single receiver. The procedure isrepeated for another tile of senders until all senders and all receivers have been covered. This implementation is found to be decisively faster thanthe others.

V. Benchmarking Results

All nine variations of themultibody Biot–Savart law computation have been benchmarked. For simplicity, the number of senders is consideredequal to the number of receivers, and this number is simply referred to as the domain size N. This is done to simplify the parameter space of thebenchmarking procedure and provide insights into the mechanism of acceleration in the context of GPGPU. Furthermore, the domain size hasbeen specifically chosen to avoid powers of 2 so as not to favor a particular cache size, which may result in a spurious timing result. In reality, theBiot–Savart algorithm is intended to be used on domain sizes ranging from 1 to 30,000 (and higher) senders and receivers.Application to differingnumbers of senders and receivers is a topic for future work. Benchmarking cases were carried out on a Dell Precision T7500 runningWindows 7,with an Intel Xeon E5607 2.26Ghz quadcore, a Tesla C2050 card, and sufficient RAM.Timing information andMATLAB implementationswereperformed in MATLAB 2011b with the Parallel Computing Toolbox v5.2 by using the tic and toc commands. The C code was compiled withdefault MATLAB executable (mex) settings and the Visual Studio 2010 compiler. Additional compile and linking flags were set for OpenMPsupport. TheCUDAkernel was compiledwith CUDASDKv4.2 inVisual Studio 2010 and targeting Compute Capability 2.0. In all cases, doubleprecision floating point numbers were used to calculate physical quantities, and in CUDA, the use-fast-math optionwas deliberately eliminated inorder to maintain the maximum possible accuracy. For OpenMP cases where multiple processors were used, the aforementioned exactconfiguration with two E5607 processors was available.

The results of the benchmarking applications are summarized in Table 3 and plotted in Fig. 4. The timing information is an average of fiveexecutions of each domain size for each algorithm. The domain size, N, was increased logarithmically. In Fig. 4, the domain is shown on theabscissa in a logarithmic (base 2) scaling. The time of execution in seconds is shown on the ordinate with a logarithmic (base 10) scaling. The

Table 3 Elapsed time (in seconds) for nine different implementations of the Biot–Savart lawfor a range of domain sizes N

N

Procedure 1000 2000 4000 8000 16,000 32,000

MATLAB, inner loop senders 0.0983 0.3770 1.5097 6.0488 24.2860 97.0567MATLAB, inner loop receivers 0.0882 0.3497 1.4005 5.6060 22.3805 89.8237MATLAB linear algebra 0.0557 0.2174 0.8837 3.4890 N/A N/AMATLAB PCT linear algebra 0.3887 0.4014 0.4540 N/A N/A N/AC, inner loop senders 0.0209 0.0820 0.3264 1.3066 5.2207 20.8859C, inner loop receivers 0.0107 0.0429 0.1733 0.6882 2.7240 10.9294OpenMP C, inner loop senders 0.0104 0.0221 0.0875 0.3489 1.4082 5.6452OpenMP C, inner loop receivers 0.0068 0.0138 0.0464 0.1804 0.7073 2.8028MATLAB PCT CUDA kernel 0.0013 0.0020 0.0040 0.0108 0.0402 0.1399

CHABALKO AND BALACHANDRAN 377

Dow

nloa

ded

by P

EN

NSY

LV

AN

IA S

TA

TE

UN

IVE

RSI

TY

on

Aug

ust 1

1, 2

014

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/1

.I01

0141

Page 7: Implementation and Benchmarking of Two-Dimensional Vortex Interactions on a Graphics Processing Unit

slowest implementation was version 1: MATLAB inner loop senders. This version was consistently 15% slower than version 2: MATLAB innerloop receivers. The MATLAB linear algebra implementation (version 3) was consistently 1.6 times faster than the fastest MATLAB loopingimplementation (version 2: MATLAB inner loop receivers). However, for large domain sizes (in this case, 16,000 points and greater), theMATLAB linear algebra implementation was unable to execute due to excessive memory requirements. The MATLAB PCT linear algebraversion (version 4) used a similar formulation to the previous version, except it ran on the GPU. This version returned inconsistent performance,and the authors were only able to run a limited number of cases since memory is severely constrained on the GPU.

BothC implementations of nested loopswere found to be faster than theirMATLABcounterparts. Version 5 (C inner loop senders) is about fourtimes faster than version 1. Placing the receivers on the inner loop increased the speed of the C implementation by almost a factor of 2, makingversion 6 (C inner loop receivers) more than eight times faster than version 2. Next, themultithreaded C code implemented with OpenMP is fasterthan the fastest single-threaded C code. Version 7, OpenMPC inner loop senders, is nominally 16 times faster thanversion 2. Placing the receiversand theOpenMP statement in the inner loop increased the speed further. Version 8, OpenMPC inner loop receivers, was found to be nominally 31times faster than version 2. Finally, version 9, MATLAB PCT CUDA kernel, ranged from 70 to 640 times faster than version 2, with the largestspeedup occurring for the largest domain size. This method was consistently faster than any other method and maintained its speed increase withlarger domain sizes. It is also almost 80 times faster than the native C code implementation. The memory transfer to and from the GPU isconsidered part of the computation in this case, and this is included as a part of the cited benchmarking metric.

The versions based on OpenMP showed excellent efficiency in scaling with the number of cores. Considering that four cores were used inbenchmarking, version 7 is 20.8 s∕5.6 s � 3.71 times faster than the single-threaded C version. This implies an efficiency of 3.71 times/fourcores � 92.8% in utilization of the available cores. Version 8 showed a 3.9 times the speedup over its single-threaded C counterpart. Overall, thismethod showed a 96% efficiency in multicore scaling. However, this scaling is not followed with a second processor. Although not shown on theplots, the benchmarking application was run on an identical machine with an additional processor. The overall speed up of the OpenMP versionincreased, however, with diminishing returns. The efficiency of related scaling on the largest domain was found to be 43% (70%) for version 8(version 9).

TheO�N2� scaling is readily observed in the data. AnO�N2� line is shown in gray dots, aligned with the data for version 1, in Fig. 4. All sevencases run on the CPU show scaling parallel to this trend. This fact remains through all of the different implementations and number of cores. Thefinal version, MATLAB PCT CUDA kernel, which was run on the GPU, begins to show O�N2� scaling on large domains; however, the domainsizes are too limited to conclusively identify a scaling trend.

An additional comparison of the fastest methods, computed as a speedupwith respect to version 2 (MATLAB inner loop receivers), is shown inFig. 5 and presented in Table 4. Benchmarking data of the fastest implementation of each group, namely, 2, 3, 6, 8, and 9, are compared. Theconsistent acceleration of each method can be identified. The MATLAB linear algebra implementation is consistently 1.6 times faster than thelooping implementation; however, due to the large memory requirements, this implementation is unable to run many of the large domain sizes.Furthermore, this implementation significantly departs from the straightforward looping implementation. Efforts in reducing execution timemaybe better directed elsewhere. The C inner loop receivers version is consistently 8.2 times faster than theMATLAB implementation. The OpenMPinner loop receivers version is consistently 31 times faster than theMATLAB implementation for large domain sizes. Finally, theMATLAB PCTCUDA kernel is up to 640 times faster than the MATLAB implementation.

1000 2000 4000 8000 16,000 32,0000.001

0.01

0.1

1

10

100

Tim

e (s

)

Domain size, N

MATLAB, inner loop senders

MATLAB, inner loop receivers

MATLAB linear algebra

MATLAB PCT linear algebra

C, inner loop senders

C, inner loop receivers

OpenMP C, inner loop senders

OpenMP C, inner loop receivers

MATLAB PCT kernel

Fig. 4 Benchmarking data for nine different implementations of the Biot–Savart law forN senders andN receivers (lower is faster).

1000 2000 4000 8000 16,000 32,0000.10

1.00

10.00

100.00

1000.00Sp

eedu

p

Domain size, N

MATLAB PCT kernel

OpenMP C, inner loopreceiversC, inner loop receivers

MATLAB matrix

MATLAB loops

Fig. 5 Plot of speedup with respect to version 2 (MATLAB, inner loop receivers) for the fastest version of each implementation (higher is faster).

Table 4 Summary of speed with respect to MATLAB inner loop receivers for the fastest of each type

Procedure 1000 2000 4000 8000 16,000 32,000

MATLAB, inner loop receivers 1.000 1.000 1.000 1.000 1.000 1.000MATLAB, linear algebra 1.583 1.609 1.585 1.607 N/A N/AC, inner loop receivers 8.229 8.154 8.080 8.145 8.216 8.219OpenMP C, inner loop receivers 12.996 25.274 30.214 31.077 31.640 32.047MATLAB PCT CUDA kernel 67.536 173.898 351.934 516.933 556.365 641.987

378 CHABALKO AND BALACHANDRAN

Dow

nloa

ded

by P

EN

NSY

LV

AN

IA S

TA

TE

UN

IVE

RSI

TY

on

Aug

ust 1

1, 2

014

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/1

.I01

0141

Page 8: Implementation and Benchmarking of Two-Dimensional Vortex Interactions on a Graphics Processing Unit

The GPU kernel is equivalent to a nested loop, including the frugal memory usage, although implemented in SIMD architecture. Memoryaccess is kept to aminimum, andmemory fetches are stored in high-speed sharedmemory for efficient access. This algorithm takes a tile-and-loopapproach, where tiles of the domain are loaded and looped over. Then, the next tile is loaded and looped over until the domain has been exhausted.The hand-coded CUDA kernel increases in computational efficiency as the domain size increases, and efficiency finally plateaus with around32,000 points in the domain.

Many benchmarks cite floating point operations per second (FLOP/S) as a measure of processor utilization. Making use of Visual Studio 10running Parallel nSight 2.2, the authors recorded an “achieved flops” count of 164Giga FLOP/S (GFLOP/S). This algorithm is both floating pointand memory intensive, and it is not expected to achieve the maximum possible FLOP/S count of the card. The reader is referred to [27,28] for afurther discussion of why the FLOP/S count is not necessarily the best measure of computational performance. For a limiting example, considerthe matrix transpose routine included with the CUDA SDK 4.2, which returned a FLOP/S count of 0 when benchmarked in a similar manner.Details of the SIMD algorithm are provided in the next section.

VI. Implementation of the Single-Instruction Multiple Data Algorithm

The implementation of the algorithm follows a tile-and-loop approach. A schematic of the algorithm is shown in Fig. 6. The size of the sharedmemory is chosen to be the same as the number of threads per block, which is referred to as a tile size. Each thread is assigned a single receiver toaccumulate the velocity influence of all sending vortices. Since each receiver is only updated by a single thread, nowrite conflicts are present andno atomic operations are requiredwhenmodifying the cumulativevelocity. In stage 1, each active thread is assigned a receiver position. In stage 2,each thread loads the location and strength of a sender into a sharedmemory location. In this manner, the entire block of sharedmemory is loadedsimultaneously by a block of threads. The number of senders that can be accommodated in the shared memory is the size of a tile. The entiredomain of senders is handled one tile at a time.

Thread #0rPos = x[0],y[0]

Thread #1rPos = x[1],y[1]

Thread #2rPos = x[2],y[2]

Stage 1

Thread #0rPos = x[0],y[0]

Thread #1rPos = x[1],y[1]

Thread #2rPos = x[2],y[2]

loadsender 2

loadsender 1

loadsender 0

…(shared memory)

Stage 2

Thread #0rPos = x[0],y[0]

Thread #1rPos = x[1],y[1]

Thread #2rPos = x[2],y[2]

Accumulate influence on receiver 0

Accumulate influence on receiver 1

Accumulate influence on receiver 2

(shared memory)

Stage 3

Thread #0rPos = x[0],y[0]

Thread #1rPos = x[1],y[1]

Thread #2rPos = x[2],y[2]

Accumulate influence on receiver 0

Accumulate influence on receiver 1

Accumulate influence on receiver 2

(shared memory)

Stage 4

Thread #0rPos = x[0],y[0]

Thread #1rPos = x[1],y[1]

Thread #2rPos = x[2],y[2]

Accumulate influence on receiver 0

Accumulate influence on receiver 1

Accumulate influence on receiver 2

(shared memory)

Stage 5

Fig. 6 Illustration of the SIMD implementation of the nested loop version of the Biot–Savart law.

CHABALKO AND BALACHANDRAN 379

Dow

nloa

ded

by P

EN

NSY

LV

AN

IA S

TA

TE

UN

IVE

RSI

TY

on

Aug

ust 1

1, 2

014

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/1

.I01

0141

Page 9: Implementation and Benchmarking of Two-Dimensional Vortex Interactions on a Graphics Processing Unit

In stages 3–5, each receiver reads through consecutive shared memory locations. Each location represents a sender, and the influence of eachsender is accumulated on each receiving location. The shared memory read can be executed by each thread simultaneously and is referred to as a“broadcast” operation. Nomemory bank conflicts are present. After the entire shared memory array has been looped through by all threads in theblock, stages 2–5 are repeated for a new tile of senders. The stages are repeated until all tiles of senders have been handled. Through this, theinfluence of every sender is accumulated on every receiver.

In reality, the number of shared memory locations is much larger than those shown in the illustration. The number of threads per block is anavailable parameter that can be adjusted for tuning the algorithm. Typically, 32 threads per block are suggested as a starting point, and other valuesshould be tested as well. A lower number is typically desirable because it allowsmany blocks to be created, and the thread scheduler can schedulethese blocks to hide latency. This particular algorithm benefits from using a large amount of shared memory rather than hiding latency. Thissuggests a larger number of threads per block to achieve a larger tile size and larger shared memory arrays.

A table of relative speeds as compared to a single thread per block for domain sizes of 4000 and 8000 elements is shown in Table 5. Thesuggested choice of 32 threads per block is not ideal, but it is not theworst choice either. The performance is similar for values greater than or equalto 128 threads per block. Avalue of 128 threads per block was used in this work to provide a good balance between performance and portability.Other GPGPU cards in the Computational Dynamics Laboratory at the University of Maryland are unable to support shared memory sizes largerthan 128 elements for this particular algorithm. The larger domain size seems to benefit the most from a larger number of threads per block.

VII. Application

A. Ground Effect of Cnoidal Waves

The implementation of the CUDA kernel is significantly faster than any other available method of computing the influence of vortices on otherpoints. This was used in the following UVLM ground-effect application. Validation of the single-body UVLM method, including experimentalvalidation of vortex dynamics, and numerical work based on the ground-effect simulation can be found in previous work [9,20]. In the presentwork, the ground is defined as a cnoidal wave described by

y�x� � h�cn�u��2 u�x� � 2K�m�λ

x (16)

where y�x� is the wave height as a function of spatial coordinate x, h is the height parameter of the 12chord, cn is the Jacobi elliptic function with

argument u,K is the complete elliptic integral of the first kind, λ is thewavelength chosen to be four chords, andm is a shape parameter chosen tobe 0.9. The body is set at a constant angle of attack of 5 deg, translating to the left at a nondimensional velocity of 1 chord per unit time. Theresulting model is shown in Fig. 7. The bound and free vortices are colored according to strength, with the leading vortex on the airfoil exceedingthe highest value on the scale. A detailed view of the wake self-convection over the same ground plane after the body has traveled 14 chordsupstream is shown in the lower portion of the figure.

TheCUDAkernel accelerated a significant portion of the computations in this case. The assembly of the influencematrix and the solution of theresulting linear system takes up a majority of the remaining computational load. These algorithms will be addressed in future work.

Fig. 7 Body moving over cnoidal waves (top), and wake rollup (bottom).

Table 5 Benchmarking the numberofthreads per block (tile size) in the SIMDimplementation of the Biot–Savart law for

two domain sizes

Threads per block N � 4000 N � 8000

1 1.00 1.0016 17.26 24.5632 24.04 39.6048 30.13 47.3864 33.71 51.0280 33.26 53.4596 33.37 57.96112 32.36 54.89128 33.15 61.70144 24.33 59.53160 28.88 58.96176 32.72 52.12192 33.10 66.33224 30.92 57.55

380 CHABALKO AND BALACHANDRAN

Dow

nloa

ded

by P

EN

NSY

LV

AN

IA S

TA

TE

UN

IVE

RSI

TY

on

Aug

ust 1

1, 2

014

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/1

.I01

0141

Page 10: Implementation and Benchmarking of Two-Dimensional Vortex Interactions on a Graphics Processing Unit

B. Vortex Filament Self-Rollup

Anadditional application of a vortex filament self rollup is considered.A collection of 50 k point vortices, all of equal strength, and governed bythe Biot–Savart law, is initially configured in a line with a sinusoidal perturbation of amplitude 0.002, as shown in the top portion of Fig. 8. Thecollection of all vortices has unit vorticity. The vortices are allowed to self-convect, and after 0.12 s, this collection assumes the configurationshown in the middle portion of Fig. 8. A later configuration (t � 0.6 s) is shown in the lower portion of the figure, and a detailed view of twomacrovortices is shown in the bottom portion of the figure. In each case, the filament is rotated about its center point in order to be depicted in ahorizontal manner. The rigid-body cutoff is apparent in the center of the detailed view of the vortices. The rigid-body region can be identified by astraight line cutting through the center of the macrovortex cores. The behavior of the filament edge is apparent, as the motions of the edges of thevortex filament are not countered by adjacent vorticity. The edges tend to wrap around the rest of the filament.

The simulation took place entirely on theGPU, using the hand-codedCUDAkernel and nativeMATLABGPUcommands for convection. Thisnegates the need to transfer memory back to the host every time step, and it results in an execution that is faster than the benchmarking case. Thesimulation required approximately 2min. of computational timewhen using theGPU. It is estimated to have taken over 24 h in a nativeMATLABimplementation. Furthermore, the multiple runs and the user interaction required to adjust parameters and arrive at a suitably demonstrativeconfiguration would have been intractable in native MATLAB.

An additional self-convection case of 50,000 equal strength vortices, beginningwith a cnoidal perturbation, is shown in Fig. 9. Again, the totalfield has unit vorticity. In this case, a cnoidal wave of h � 0.002,m � 0.9, and λ � 1∕9 served as the initial perturbation. A snapshot obtainedafter 0.12 s is shown in the middle portion of Fig. 9; this figure reveals a different character to the perturbation than that seen in the previoussinusoidal case. This is due to the initial cnoidal wave having a periodic spikelike behavior (m � 0.9) rather than that of a smooth and equallydistributed sinusoid. The individual spikes develop into large perturbations, and the flat portion between spikes develops into a smallerperturbation. The large and small perturbations finally coalesce into amacrovortex. The final configuration, shown in the lower portion of Fig. 9, isshown at t � 0.362, almost half the amount of time of the sinusoidal case. A detailed view of the macrovortices is shown in the bottom portion ofthe figure. Again, the edge effects are apparent and tend to roll up around the rest of the filament.

VIII. Conclusions

The SIMD architecture available on modern GPGPU cards is especially appropriate for the acceleration of the all pairs N-body problem. Atypical graphics card can accelerate such a simulation by up to 640 times faster than a standard CPU-based MATLAB simulation, up to 80 timesfaster than a single-threaded C implementation, and more than 20 times faster than an OpenMP C implementation on a quad core processor. Themechanism responsible for the large speedup is the efficient use of sharedmemory by theGPUhardware. This trait is severely constrained onCPUhardware. Furthermore, the GPGPU paradigm scales with domain size and hardware improvements, and it is not expensive in memory usage.Preliminary results indicate that a more current graphics card (Tesla K20c) returns twice the performance of the card used in this benchmarking

Fig. 8 Sinusoidal vortex filament rollup; three phases (top–middle); detailed view (bottom).

Fig. 9 Cnoidal vortex filament rollup; three phases (top–middle); detailed view (bottom).

CHABALKO AND BALACHANDRAN 381

Dow

nloa

ded

by P

EN

NSY

LV

AN

IA S

TA

TE

UN

IVE

RSI

TY

on

Aug

ust 1

1, 2

014

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/1

.I01

0141

Page 11: Implementation and Benchmarking of Two-Dimensional Vortex Interactions on a Graphics Processing Unit

application (approximately 1300 times faster thanversion 2). An alternative linear algebra formulation is prohibitive inmemory requirements anddoes not offer improvements in computational time.

The significant accelerations allow user interactions and simulation domain sizes that were previously computationally intractable. Forillustration, a flat plate, and its wake, moving in cnoidal ground effect have been examined. The wake rollup was found to be influenced by theground topology. Furthermore, the life cycle of two separate vortex filaments, one sinusoidal and one cnoidal in initial shape, composed of 50,000vortices eachwere simulated in a self-rollup configuration. The cnoidal vortex filament attained a rolled up state almost twice as fast as the initiallysinusoidal filament. This efficient computation may usher in a new era of computational modeling efforts.

Appendix: Implementation Source Code

Source code for the implementations described in the paper can be found in figures (Figs. A1–A9):

Fig. A1 MATLAB inner loop senders.

Fig. A2 MATLAB inner loop receivers.

Fig. A3 MATLAB linear algebra.

382 CHABALKO AND BALACHANDRAN

Dow

nloa

ded

by P

EN

NSY

LV

AN

IA S

TA

TE

UN

IVE

RSI

TY

on

Aug

ust 1

1, 2

014

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/1

.I01

0141

Page 12: Implementation and Benchmarking of Two-Dimensional Vortex Interactions on a Graphics Processing Unit

Fig. A5 C inner loop senders.

Fig. A6 C inner loop receivers.

Fig. A4 MATLAB PCT linear algebra.

CHABALKO AND BALACHANDRAN 383

Dow

nloa

ded

by P

EN

NSY

LV

AN

IA S

TA

TE

UN

IVE

RSI

TY

on

Aug

ust 1

1, 2

014

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/1

.I01

0141

Page 13: Implementation and Benchmarking of Two-Dimensional Vortex Interactions on a Graphics Processing Unit

Acknowledgment

Partial support received for this work through National Science Foundation grants CMMI-1125285 and CMMI-1250187 is gratefullyacknowledged.

References

[1] Espasa, R., Valero, M., and Smith, J. E., “Vector Architectures: Past, Present and Future,” Proceedings of the 12th International Conference on

Supercomputing, ACM, New York, 1998, pp. 425–432.[2] Luebke, D., Harris, M., Govindaraju, N., Lefohn, A., Houston, M., Owens, J., Segal, M., Papakipos, M., and Buck, I., “GPGPU: General-Purpose

Computation on Graphics Hardware,” Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, ACM, New York, 2006.[3] Chai, L.,Gao,Q., andPanda,D.K., “Understanding the Impact ofMulti-CoreArchitecture inClusterComputing:ACaseStudywith IntelDual-CoreSystem,”

7th IEEE International Symposium on Cluster Computing and the Grid, 2007, IEEE, Piscataway, NJ, 2007, pp. 471–478.

Fig. A9 MATLAB PCT CUDA kernel.

Fig. A8 OpenMP C inner loop receivers.

Fig. A7 OpenMP inner loop senders.

384 CHABALKO AND BALACHANDRAN

Dow

nloa

ded

by P

EN

NSY

LV

AN

IA S

TA

TE

UN

IVE

RSI

TY

on

Aug

ust 1

1, 2

014

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/1

.I01

0141

Page 14: Implementation and Benchmarking of Two-Dimensional Vortex Interactions on a Graphics Processing Unit

[4] Ito, T., Makino, J., Ebisuzaki, T., and Sugimoto, D., “A Special-Purpose N-Body Machine GRAPE-1,” Computer Physics Communications, Vol. 60, No. 2,1990, pp. 187–194.doi:10.1016/0010-4655(90)90003-J

[5] Verlet, L., “Computer ‘Experiments’ on Classical Fluids. I. Thermodynamical Properties of Lennard-Jones Molecules,” Physical Review, Vol. 159, No. 1,1967, pp. 98–103.doi:10.1103/PhysRev.159.98

[6] Langston, P. A., Masling, R., andAsmar, B. N., “CrowdDynamics Discrete ElementMulti-CircleModel,” Safety Science, Vol. 44, No. 5, 2006, pp. 395–417.doi:10.1016/j.ssci.2005.11.007

[7] Katz, J., and Plotkin, A., Low-Speed Aerodynamics, Vol. 13, Cambridge Univ. Press, New York, 2001.[8] Fitzgerald, T., Valdez, M., Vanella, M., Balaras, E., and Balachandran, B., “Flexible Flapping Systems: Computational Investigations into Fluid–Structure

Interactions,” Aeronautical Journal, Vol. 115, No. 1172, 2011, pp. 593–604.[9] Chabalko, C., Fitzgerald, T., Valdez,M., andBalachandran, B., “FlappingAerodynamics andGround Effect,” 50th AIAAAerospace SciencesMeeting, AIAA

Paper 2012-0420, 2012.[10] Long, L.N., and Fritz, T. E., “Object-OrientedUnsteadyVortexLatticeMethod for Flapping Flight,” Journal of Aircraft, Vol. 41, No. 6, 2004, pp. 1275–1290.

doi:10.2514/1.7357[11] Konstadinopoulos, P., Thrasher, D. F.,Mook, D. T., Nayfeh, A. H., andWatson, L., “AVortex-LatticeMethod for General, UnsteadyAerodynamics,” Journal

of Aircraft, Vol. 22, No. 1, 1985, pp. 43–49.doi:10.2514/3.45078

[12] Boschetti, P. J., Cárdenas, E. M., Amerio, A., and Arevalo, A., “Stability and Performance of a Light Unmanned Airplane in Ground Effect,” 48th AIAA

Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition, AIAA Paper 2010-0293, 2010.[13] Whale, J., Anderson, C. G., Bareiss, R., and Wagner, S., “An Experimental and Numerical Study of the Vortex Structure in the Wake of a Wind Turbine,”

Journal of Wind Engineering and Industrial Aerodynamics, Vol. 84, No. 1, Jan. 2000, pp. 1–21.doi:10.1016/S0167-6105(98)00201-3

[14] Hamada, T., and Iitaka, T., “TheChamomile Scheme:AnOptimizedAlgorithm forN-BodySimulations on ProgrammableGraphics ProcessingUnits,”ArXivAstrophysics e-prints, astro-ph/0703100, March 2007.

[15] Nguyen, H., Gpu Gems 3, Addison–Wesley Professional, Reading, MA, 2007, pp. 667–694.[16] Frenkel, D., and Smit, B., Understanding Molecular Simulation, Second Edition: From Algorithms to Applications, Academic Press, New York, 2001,

pp. 291–316.[17] Phillips, E.H., Zhang,Y.,Davis, R. L., andOwens, J. D., “Acceleration of 2-DCompressible FlowSolverswithGraphics ProcessingUnitClusters,” Journal of

Aerospace Computing, Information, and Communication, Vol. 8, No. 8, 2011, pp. 237–249.doi:10.2514/1.44909

[18] Stock, M. J., and Gharakhani, A., “Graphics Processing Unit-Accelerated Boundary Element Method and Vortex Particle Method,” Journal of AerospaceComputing, Information, and Communication, Vol. 8, No. 7, 2011, pp. 224–236.doi:10.2514/1.52938

[19] Chabalko,C., andBalachandran,B., “GPUBasedSimulations of Physical SystemsCharacterized byMobileDiscrete Interactions,”Developments inParallel,Distributed, Grid and Cloud Computing for Engineering, Saxe-Coburg Publ., Stirlingshire, Scotland, U.K., 2013, pp. 95–124.

[20] Chabalko, C. C., Fitzgerald, T., and Balachandran, B., “GPGPU Implementation and Benchmarking of the Unsteady Vortex Lattice Method,” 51st AIAAAerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition, AIAA Paper 2013-0288, 2013.

[21] Arenaz, M., Touriño, J., and Doallo, R., “XARK: An EXtensible Framework for Automatic Recognition of Computational Kernels,” ACM Transactions on

Programming Languages and Systems (TOPLAS), Vol. 30, No. 6, 2008.doi:10.1145/1391956.1391959

[22] Luebke, D., “CUDA: Scalable Parallel Programming for High-Performance Scientific Computing,” 5th IEEE International Symposium on Biomedical

Imaging: From Nano to Macro, 2008, IEEE, Piscataway, NJ, pp. 836–838.[23] Allen, R., and Kennedy, K., Optimizing Compilers for Modern Architectures, Morgan Kaufmann, San Mateo, CA, 2002.[24] Cooper, D., Subramanian, D., and Torczon, L., “Adaptive Optimizing Compilers for the 21st Century,” Journal of Supercomputing, Vol. 23, No. 1, 2001,

pp. 7–22.doi:10.1023/A:1015729001611

[25] Pappas, P. T., “The Original Ampere Force and Biot–Savart and Lorentz Forces,” Il Nuovo Cimento B Series 11, Vol. 76, No. 2, 1983, pp. 189–197.doi:10.1007/BF02721552

[26] “The OpenMPAPI Specification for Parallel Programming,” OpenMPArchitectural Review Board, openmp.org, 2009.[27] Volkov, V., and Demmel, J. W., “Benchmarking GPUs to Tune Dense Linear Algebra,” Proceedings of the 2008 ACM/IEEEConference on Supercomputing,

IEEE, Piscataway, NJ, 2008, pp. 31:1–31:11.[28] Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S, Stone, S. S., Kirk, D. B., and Hwu, W. W., “Optimization Principles and Application Performance Evaluation

of a Multithreaded GPU Using CUDA,” Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM,New York, 2008, pp. 73–82.

E. AtkinsAssociate Editor

CHABALKO AND BALACHANDRAN 385

Dow

nloa

ded

by P

EN

NSY

LV

AN

IA S

TA

TE

UN

IVE

RSI

TY

on

Aug

ust 1

1, 2

014

| http

://ar

c.ai

aa.o

rg |

DO

I: 1

0.25

14/1

.I01

0141