Performance Evaluation of Particle Swarm Optimization ... · Performance Evaluation of Particle...

Performance Evaluation of Particle Swarm OptimizationAlgorithms on GPU using CUDA

V. KRISHNA REDDY1 AND L. S. S. REDDY2

1Associate Professor, Department of Computer Science and Engineering2Director, Lakkireddy Bali Reddy College of Engineering, Mylavaram, Andhra Pradesh

Abstract: Particle Swarm Optimization (PSO) may be easy but powerful optimizationalgorithm relying on the social behavior of the particles. PSO has become popular due to itssimplicity and its effectiveness in wide range of application with low computational cost.The main objective of this paper is to implement a parallel Asynchronous version andSynchronous versions of PSO on the Graphical Processing Unit (GPU) and compare theperformance in terms of execution time and speedup with their sequential versions on theGPU. We also present the Implementation details and Performance observations of parallelPSO algorithms on GPU using Compute Unified Device Architecture (CUDA), a softwareplatform from nVIDIA. We observed that the Asynchronous version of the algorithmoutperforms other versions of the algorithm.Keywords: Particle Swarm Optimization, Graphical Processing Unit, and Compute UnifiedDevice Architecture.

I. INTRODUCTIONParticle Swarm Optimization (PSO) may be easy but powerful optimizationalgorithm, introduced by Kennedy and Eberhart in 1995 [2]. PSO searches theoptimum of a task, termed fitness task, following rules impressed by the behaviorof flocks of birds searching for food. As a population based mostly meta-heuristic,PSO has recently gained a lot of and a lot of popularity attributable to its robustness,effectiveness, and ease. regardless of the decisions of the algorithm structure,parameters, etc., and despite sensible convergence properties, PSO is still an iterativestochastic search process, which, depending on problem hardness, may require alarge number of particle updates and fitness evaluations.

Therefore, designing efficient PSO implementations is a problem of greatpractical relevance. It is even additional essential if one considers real-timeapplications to dynamic environments in which, for example, the fast-convergenceproperties of PSO is also used to trace moving points of interest (maxima or minimaof a specific dynamically-changing fitness function). There are variety of pc visionapplications within which PSO has been used to trace moving objects or to see

I J C S S E I T, Vol. 5, No. 1, June 2012, pp. 65-81

66 V. Krishna Reddy and L. S. S. Reddy

location and orientation of objects or posture of individuals. Some of theseapplications rely on the use of GPU multi-core architectures for general-purposehigh-performance parallel computing, which have recently attracted researchersinterest a lot of and a lot of, particularly when handy programming environments,such as nVIDIA CUDA [4], have been introduced.

Such environments or APIs cash in of the computing capabilities of GPUs usingparallel versions of high-level languages that need that solely the best level detailsof parallel method management be explicitly encoded within the programs. Theevolution each of GPUs and of the corresponding programming environments hasbeen very quick and, up to now, off from any standardization.

The paper is organized as follows: Section II provides a description of PSObasic, standard PSO synchronous and asynchronous versions of the algorithm,alongside advantages and disadvantages. Design and Implementation details areprovided in Section III. Section IV summarizes and compares the results obtainedon classical benchmark with the concluding remarks presented in Section V.

II. PSO OVERVIEWThis section provides four versions of PSO algorithm with their advantages anddisadvantages. First basics of PSO algorithm are presented followed by standardPSO later with Synchronous and Asynchronous PSO Algorithm.

2.1. PSO BasicsThe core of PSO[2] is represented by the two functions which update a particle’sposition and velocity within the domain of the fitness function at time t+1, whichcan be computed using the following equations:

Vid t + 1 = w Vid(t) + clrl X Best t – Xid t + c2r2 XgBest t–Xid t (1)Xid t + 1 = Xid t + Vid(t+1) (2)

where i =1, 2, ...N, N indicates the number of particles in the swarm namely thepopulation. d =1, 2, ...D, D is the dimension of solution space. In Equations (1) and(2), the learning factors c1 and c2 are nonnegative constants, r1 and r2 are randomnumbers consistently distributed in the interval [0, 1], Vid ∈ –Vmax, Vmax whereVmax is also a chosen most velocity that’s a relentless preset consistent with theobjective optimization perform. If the rate on one dimension exceeds the foremost,it will be set to Vmax. This parameter controls the convergence rate of the PSO andmay stop the strategy from growing too quick. The parameter w is that the inertiaweight used to balance the planet and native search skills that could be a constantwithin the interval [0, 1]. Xid t +1 is that the position of the particle at time t +1,XBest t is that the best-fitness position reached by the particle up to time t (alsotermed personal attractor), XgBest is that the best-fitness purpose ever found bythe full swarm (social attractor). Despite its simplicity, PSO is understood to be

Performance Evaluation of Particle Swarm Optimization Algorithms on GPU using CUDA 67

quite sensitive to the selection of its parameters. Underneath bound conditions,though, it are often proved that the swarm reaches a state of equilibrium, whereparticles converge onto a weighted average of their personal best and internationalbest positions.

2.2. Standard Particle Swarm Optimization (SPSO)In 2007, Daniel Bratton and James Kennedy designed a Standard Particle SwarmOptimization (SPSO) that could be a easy extension of the first algorithm whereastaking into consideration more modern developments which will be expected toenhance performance on standard measures.

SPSO is different from original PSO mainly in the following aspects [7]:(1) Swarm Communication Topology: Original PSO uses a global topology showed

in Figure 2.1(a). In this topology, the most effective particle, that is to blame for thespeed updating of all the particles, is chosen from the entire swarm population.whereas in SPSO there’s no world best, each particle solely uses a neighborhoodbest particle for velocity updating, that is chosen from its left, right neighbors anditself. we tend to decision this a neighborhood or ring topology’s, as shown inFigure 2.1(b). (Assuming that the swarm has a population of 12).

Figure 2.1: PSO and SPSO Topologies

2) Inertia Weight and Constriction: In PSO, an inertia weight parameter wasdesigned to regulate the influence of the previous particle velocities on theoptimization method. By adjusting the worth of w, the swarm encompasses a biggertendency to eventually constrict itself right down to area containing the mosteffective fitness and explore that area in detail. The same as the parameter w, SPSOintroduced a replacement parameter χ called the constriction issue that springsfrom the present constants within the velocity update equation:


χ = 2|2 – – 2 – 4= c1+c2

and the velocity updating formula in SPSO is:Vid t + 1 = χ Vid t + c1r1 XBest t – Xid t + c2r2 XlBest t – Xid t (3)

Where Xl Best is not any longer global best however the local best.Statistical tests have shown that compared to PSO, SPSO will come higher

results, whereas retaining the simplicity of PSO. The introduction of SPSO willprovide researchers a typical grounding to figure from. SPSO are often used as amethod of comparison for future developments and enhancements of PSO.

2.3. Synchronous PSOA main feature that affects the search performance of PSO is that the strategy inkeeping with that the social attractor is updated. In synchronous PSO [3], positionsand velocities of all particles are updated one once another in a very generation;this can be truly a full algorithm iteration, that corresponds to at least one discretetime unit. Among constant generation, once velocity and position are updated,every particle’s fitness, love its new position, is evaluated. The worth of the socialattractor is barely updated at the tip of every generation, when the fitness values ofall particles within the swarm are known. The sequence of steps for asynchronousPSO is shown in Figure 2.2.

Figure 2.2: Sequential Synchronous PSO


2.4. Asynchronous PSOThe asynchronous version of PSO, instead, permits the social attractors to beupdated immediately once evaluating every particle’s fitness, which causes theswarm to maneuver additional promptly towards newly-found optima. Inasynchronous PSO [1], the rate and position update equations are often applied toany particle at any time, in no specific order. The sequence of steps for asynchronousPSO is shown in Figure 2.3.

III. SYSTEM DESIGN AND IMPLEMENTATION DETAILS

3.1. Parallel PSO for GPUSAlmost all recent GPU implementations assign one thread to each particle [3, 5,and 6] which, in turn, means that fitness evaluations have to be computedsequentially in a loop within each particle’s thread. Since fitness calculation is oftenthe most computation-intensive part of the algorithm, the execution time of suchimplementations is affected by the complexity of the fitness function and thedimensionality of the search domain. These GPU implementations of PSO do nottake full advantage of the GPU power in evaluating the fitness function in parallel.The parallelization only occurs on the number of particles of a swarm and ignoresthe dimensions of the function.

In the parallel implementations the thread parallelization used as fine-grainedas possible [1], in other words, all independent sequential parts of the code are

Figure 2.3: Sequential Asynchronous PSO


allowed to run simultaneously in separate threads. However, the performance ofan implementation does not only depend on the design choices, but also on theGPU architecture, data access scheme and layout, and the programming model,which in this case is CUDA. Therefore, it seems appropriate to outline the CUDAarchitecture and introduce some of its terminology.

3.2. CUDA BackgroundCUDA is a programming model and instruction set architecture leveraging theparallel computing capabilities of nVIDIA GPUs to solve complex problems moreefficiently than a CPU. At the abstract level, the programming model needs thatthe developer divide the matter into coarse sub-problems, specifically thread blocks,that may be solved independently in parallel, and every sub-problem into fineritems that may be solved cooperatively by all threads among the block [4]. Fromthe software point of view, a kernel is equivalent to a high-level programminglanguage function or method containing all the instructions to be executed by allthreads of each thread block. Finally, at the hardware level, nVIDIA GPUs consistof a number of identical multithreaded Streaming Multiprocessors (SM), each ofwhich is made up of several cores that are able to run one thread block at a time. Asthe program invokes a kernel, a scheduler assigns thread blocks to SMs accordingto the number of available cores on each SM; the scheduler also ensures that delayedblocks are executed in an orderly fashion when more resources or cores are free.This makes a CUDA program automatically scalable on any number of SMs andcores.

The last thing to highlight is the memory hierarchy available to threads, andthe performance associated with the read/write operations from/to each of thememory levels. Each thread has its own local registers and all threads belonging tothe same thread-blocks can cooperate through shared memory. Registers and sharedmemory are physically embedded inside SMs and provide threads with the fastestpossible memory access. Their lifetime is the same as the thread-block. All thethreads of a kernel can also access global memory whose content persists over allkernel launches [4]; however, read and write operations to global memory are ordersof magnitude slower than those to shared memory and registers, therefore accessto global memory should be minimized within a kernel.

The design and implementation issues of our algorithms are presented in thefollowing sections.

3.2. Synchronous GPU-SPSOThe synchronous implementation [3] comprises three stages (kernels), namely:positions update, fitness evaluation, and bests update. Each kernel is parallelizedto run a thread for each problem dimension. The function under consideration isoptimized by iterating those kernels needed to perform one PSO generation. The


three kernels must be executed sequentially and synchronization must occur at theend of each kernel run. Figure 3.1 better clarifies this structure. Since the algorithmis divided into three independent sequential kernels, each kernel must load all thedata it needs initially and store the data back into global memory at the end of itsexecution. CUDA rules dictates that information sharing between different kernelsis achievable only through the global memory.

To better understand the difference between synchronous and asynchronousPSO the pseudo-code of the sequential versions of the algorithms is presented inFigure 2.2 and Figure 2.3. The synchronous 3-kernel implementation of GPU-SPSO,while allowing for virtually any swarm size, required synchronization points whereall the particles data had to be saved to global memory to be read by the nextkernel. This frequent access to global memory limited the performance ofsynchronous GPU-SPSO and was the main justification behind the asynchronousimplementation.

3.3. Asynchronous GPU- SPSOThe design of the parallelization process for the asynchronous version [1] is thesame as for the synchronous one, that is: we allocate a thread block per particle,each of which executes a thread per problem dimension. This way every particleevaluates its fitness function and updates position, velocity, and personal best foreach dimension in parallel. The main effect of the removal of the synchronizationconstraint is to let each particle evolve independently of the others, which allow itto keep all its data in fast-access local and shared memory, effectively removingthe need to store and maintain the global best in global memory. In practice, everyparticle checks its neighbor’s personal best finesses, and then updates its ownpersonal best in global memory only if it is better than the previously found personalbest fitness. This can speed up execution time dramatically, particularly when thefitness function itself is highly parallelizable.

In contrast to the synchronous version, all particle thread blocks must beexecuting simultaneously, i.e., no sequential scheduling of thread blocks toprocessing cores is employed, as there is no explicit point of synchronization of allparticles. Two diagrams representing the parallel execution for both versions areshown in Figure 3.1. Having the swarm particles evolve independently not onlymakes the algorithm more biologically plausible, but it also does make the swarmmore reactive to newly discovered minima/maxima [1]. The price to be paid is alimitation in the number of particles in a swarm which must match the maximumnumber of thread blocks that a certain GPU can maintain executing in parallel.This is not such a relevant shortcoming, as one of PSO s nicest features is its goodsearch effectiveness; because of this, only a small number of particles (a few dozen)is usually enough for a swarm search to work, which compares very favorably tothe number of individuals usually required by evolutionary algorithms to achieve


good performance when high-dimensional problems are tackled. Also, currently,parallel system processing chips are scaling according to Moore’s law, and GPUsare being equipped with more processing cores with the introduction of everynew model.

Figure 3.1: Asynchronous CUDA-PSO: Particles Run in Parallel Independently (left).Synchronous CUDA-PSO: Particles Synchronize at the End of Each Kernel (right)

4. RESULTSIn this report, comparison of the performance of different versions of parallel PSOimplementations and the sequential PSO implementations on a classical benchmarkwhich comprised a set of functions which are often used to evaluate stochasticoptimization algorithms has been presented. The goal was to compare differentparallel PSO implementations with one another and with the sequentialimplementations, in terms of execution time and speed, while checking that thequality of results was not badly affected by the sequential implementations. So allparameters of the algorithm are kept equal in all tests, setting them to the standardvalues suggested in [7]: w =0.729 and C1 = C2 =2.000. Also, for the comparison tobe as fair as possible, the SPSO was adapted by substituting its original stochastic-star topology with the same ring topology adopted in the parallel GPU-basedversions.

For the experiments, the parallel algorithms were developed using CUDAversion 4.0. Tests were performed on graphic card (see Table 4.1 for detailedspecifications). The sequential implementations was run on a PC powered by a 64-bits Intel(R) Core(TM) i3 CPU running at 2.27GHz.


Table 4.1Major Technical Features of the GPU used for The Experiments

Model Name GeForce GTS250GPU clock(GHz) 1.62StreamMulti Processors 16CUDA cores 128Bus width (bit) 256Memory (MB) 1024Memory clock (MHz) 1000.0CUDA compute capability 1.1

through, we just pass two random integer numbers P1, P2 ∈ [0, M – D * N] fromCPU to GPU, then 2*D*N numbers can be drawn from array R starting at R(P1)and R(P2), respectively, instead of transporting 2*D*N numbers from CPU to GPU.

Table 4.2Benchmark Test Functions

Name Equation Bounds Initial Bounds Optimum

f1 (Sphere) Xi2Di=1 (–100,100)D (50,100)D 0.0Df2 (Rastrigrin) [xi2–10*cos (2Di=1πxi)+10] (–5.12,5.12)D (2.56,5.12)D 0.0Df3 (Rosenbrock) (100(xi+1–xi2)2D–1 i=1+(xi–1)2) (–30,30)D (15,30)D 0.0Df4 (Griewank) 14000 [xi2–cos(xi i)Di=1 Di=1+1] (–600,600)D (–600,600)D 0.0D

The following implementations of PSO have been compared:1. The sequential synchronous SPSO version modified to implement a two

nearest- neighbors ring topology (cpu-syn).2. The sequential asynchronous PSO version uses stochastic star topology (cpu-

asyn).3. The synchronous three kernel version of GPU-SPSO (gpu-syn).4. The asynchronous one kernel version of GPU-SPSO (gpu-asyn).

Performance Metric• Computational cost (C) also known as execution time is defined as the processing

time (in seconds) that the PSO algorithm consumes.• Computational throughput (V) is defined as the inverse of the computational cost:

V = 1/C• Speedup(S): measures the reached execution time improvement. It is a rate that

evaluates how rapid the variant of interest is in comparison with the variant ofreference.


S = VobjVrefWhere Vobj is the throughput of the parallel implementation under study, andVref is the throughput of the reference implementation, i.e. the sequentialimplementation.

The code was tested on the standard benchmark functions shown in Table 4.2.And the results are floated in the graphs shown in the Figures 4.1, 4.2, 4.3 and 4.4.For each function, the following comparisons have been made;1. Keeping Problem dimension as constant

(a) Execution time vs. swarm population size(b) Achieved speedup vs. swarm population

2. Keeping Swarm population size as constant(a) Execution time vs. problem dimension(b) Achieved speedup vs. problem dimensionIn general, the asynchronous version was much faster than the synchronous

version. The asynchronous version allows the social attractors to be updatedimmediately after evaluating each particle s fitness, which causes the swarm tomove more promptly towards newly-found optima. From the Figures 4.1, 4.2, 4.3and 4.4, it is clear that the GPU-asynchronous version taking less execution timethan others.

The reason behind the unexpected behavior of the sequential algorithmregarding execution time, which appears to be non-monotonic with problemdimension, showing a surprising decrease as problem dimension becomes larger.In fact, code optimization (or the hardware itself) might lead several multiplicationsto be directly equaled to zero without even performing them, as soon as the sum ofthe exponents of the two factors is below the precision threshold; a similar thoughopposite consideration can be made for additions and the sum of the exponents.One observation that adding up terms all of comparable magnitude is much slowerthan adding the same number of terms on very different scales.

It is also worth noticing that the execution time graphs are virtually identicalfor the functions taken into consideration, which shows that GPUs are extremelyeffective at computing arithmetic-intensive functions, mostly independently of theset of operators used, and that memory allocation issues are prevalent indetermining performance.

The asynchronous version of GPU-SPSO algorithm was able to significantlyreduce execution time with respect not only to the sequential versions but also tosynchronous version of GPU-SPSO. Depending on the degree of parallelizationallowed by the fitness functions we considered, the asynchronous version of GPU-SPSO could reach speed-ups ranging from 25 (Rosenbrock, Griewank) to over 75(Rastrigin) with respect to the sequential implementation, and often of more than


one order of magnitude with respect to the corresponding GPU-based 3-kernelsynchronous version. From the results, one can also notice that the bestperformances were obtained on the Rastrigrin function. This is presumably as resultsof the presence of advanced math functions in their definition. In fact, GPUs haveinternal fast math functions which can provide good computation speed at thecost of slightly lower accuracy, which causes no problems in this case.

Sphere Function

(a) Execution time vs. swarm population:w = 0.729, D = 50, Iter = 2000

(b) Achieved speedup vs. swarm population:w = 0.729, D = 50, Iter = 2000


(d) Achieved speedup vs. problem dimension:w = 0.729, N = 100, Iter = 2000

Figure 4.1: Execution time and Speedup vs. swarm Population Size and ProblemDimensions for Sphere Function

(c) Execution time vs. problem dimension:w = 0.729, N = 100, Iter = 2000


Rastrigrin Function





Figure 4.2: Execution Time and Speedup vs. Swarm Population Size and ProblemDimensions for Rastrigrin Function





Rosenbrock Function



Figure 4.3: Execution Time and Speedup vs. Swarm Population Size and ProblemDimensions for Rosenbrock Function


(b) Achieved speedup vs. problem dimension:w = 0.729, D = 50, Iter = 2000

Griewank Function




Figure 4.4: Execution Time and Speedup vs. Swarm Population Size and ProblemDimensions for Griewank Function

5. CONCLUSION & FUTURE WORKSIt was observed from the results that asynchronous version of GPU-SPSO algorithmwas able to significantly reduce execution time with respect not only to thesequential versions but also to synchronous version of GPU-SPSO. Depending onthe degree of parallelization allowed by the fitness functions we considered, theasynchronous version of CUDA-PSO could reach speed-ups of up to about 75 (inthe tests with the highest-dimensional Rastrigin functions) with respect to the


sequential implementation, and often more than one order of magnitude withrespect to the corresponding GPU-based 3-kernel synchronous version, sometimesshowing a limited, possibly only apparent, decrease of search performances. Henceasynchronous GPU-SPSO is preferable to optimize functions with more complexarithmetic, with the same swarm population, dimension and number of iterations.Furthermore, function with more complex arithmetic has a higher speed up. Forproblems with large swarm population or higher dimensions, the asynchronousGPU-PSO will provide better performance. Since most display card in currentcommon PC has GPU chips, more researchers can make use of this parallelasynchronous GPU-SPSO to solve their practical problems.

Future work will include updating and extending this asynchronous GPU-SPSOto the applications of PSO and improving the performance. Other interestingdevelopments may be offered by the availability of OpenCL, which will allowowners of different GPUs (as well as multi-core CPUs, which are also supported)than nVIDIA’s to implement parallel algorithms on their own computingarchitectures. The availability of shared code which allows for optimized codeparallelization even on more traditional multi-core CPUs will make the comparisonbetween GPU-based and multi-core CPUs easier (and, possibly, fairer) besidesallowing for a possible optimized hybrid use of computing resources in moderncomputers.

References[1] Mussi, L.; Nashed, Y. S. G.; Cagnoni, S., “GPU-based Asynchronous Particle Swarm

Optimization”. Proceedings of GECCO’ 11, 2011, pp. 1555-1562.[2] Kennedy, J.; Eberhart, R., “Particle Swarm Optimization”, IEEE International Conference on Neural

Networks, Perth, WA, Australia, Nov. 1995, pp. 1942-1948.[3] You Zhou, Ying Tan, “GPU – based Parallel Particle Swarm Optimization”, IEEE Congress on

Evolutionary Computation, 2009, pp. 1493-1500.[4] nVIDIA Corporation, “nVIDIA CUDA programming guide 3.2”, October 2010.[5] Zhou, Y.; Tan, Y., “Particle swarm optimization with triggered mutation and its implementation

based on GPU”, Proceedings of the 12th Annual Genetic and Evolutionary Computation ConferenceGECCO’10, 2010, pp. 1007–1014.

[6] Venter, G.; Sobieski, J., “A Parallel Particle Swarm Optimization Algorithm Accelerated byAsynchronous Evaluations”, 6th World Congresses of Structural and Multidisciplinary Optimization,2005.

[7] Bratton, D.; Kennedy, J., “Defining a Standard for Particle Swarm Optimization”, IEEE SwarmIntelligence Symposium, April 2007, pp. 120-127.

[8] Wilkinson, B., “General Purpose Computing using GPUs: Developing a hands-on undergraduateCourse on CUDA Programming”, 42nd ACM Technical Symposium on Computer Science Education,SIGCSE 2010, Dallas, Texas, USA , March 9-12, 2010.

Performance Evaluation of Particle Swarm Optimization ... · Performance Evaluation of Particle...

Documents

Transcript of Performance Evaluation of Particle Swarm Optimization ... · Performance Evaluation of Particle...