The Evolution of the Computing Time in the simulation of ... · upper link of the pantograph, ......

8
The 14th IFToMM World Congress, Taipei, Taiwan, October 25-30, 2015 DOI Number: 10.6567/IFToMM.14TH.WC.OS2.030 The Evolution of the Computing Time in the simulation of Mimbot-Biped Robot using Parallel Algorithms A. Bustos H. Rubio J. C. García Prada J. Meneses University Carlos III of Madrid Madrid, Spain Abstract: This work presents a biped walking robot with a reduced number of degrees of freedom. Based on classical mechanisms, MIMBOT robot is able to emulate advanced human skills. Their topologies are described as well as mathematical models are proposed in order to solve the kinematics. Resulting explicit equations from this model are implemented in several computing platforms in two ways, the classical one using the CPU to execute the routines and another one taking advance of the parallel capabilities of GPUs. Several simulations are carried out over all platforms with the aim of analyzing the behaviour of the hardware and software used. Finally, results are presented showing that not always parallelizing the simulation is the best option. Keywords: GPGPU, parallel computing, biped, walking robot, simulation I. Introduction Legged locomotion, especially biped walking, has aroused great interest from universities and companies in recent decades. Lots of robots have been built using several kinds of leg [1]. Most of these biped robots copy the structure of the human leg. However, this approach is really complex and presents a series of disadvantages as high weight, numerous driven joints and a sophisticated electronic control [2]. Another way to address this problem is through classical mechanisms of 1 degree of freedom (DOF). Thus, it is obtained simpler robots whit few motors that also simulate the human gait with lower energy consumption. Nevertheless, especially prof. Ceccarelli and his team at Laboratory of Robotics and Mechatronics (LARM) have been continuously working in this field. Proof of this are the biped robots EP-WAR, CALUMA and other legged walking robots using classical mechanisms [3], [4], [5]. The PASIBOT biped walking robot, Fig. 1, arises from the cooperation between LARM and MAQLAB research groups, about which several studies have been published [6], [7], [8]. This robot follows the principles about simple design. Indeed, it is a 1 DOF mechanical system based on the intelligent combination of classical mechanisms. Despite this simplicity, it reproduces the human walking. The evolution of the mechanical system gives the robot new skills thanks to the addition of two linear actuators on each hip [8]. This evolution also implies the change of name: it will be called MIMBOT from now. Kinematic models have been developed for this biped robot, requiring the solution of numerous simulations with large computational costs. 1 [email protected] 2 [email protected] 3 [email protected] 4 [email protected] Fig. 1. Prototype of PASIBOT robot. Graphics Processing Units (GPU) have a collection of attributes (hundred or thousand of cores, fast access to onboard memory, high level of task parallelization, etc.) [9] that make them really interesting to compute iterative and/or big problems. In 2006, NVIDIA introduces the CUDA (Compute Unified Device Architecture) on its GPUs. This new structure, along with the SDKs (Software Development Kit) and APIs (Application Programming Interface) developed, has lead computation to a new level, creating the GPGPU (General-Purpose Computing on Graphics Processing Units) and saving huge amounts of time on computing big and complex problems. Another remarkable aspect is their cost: the most advanced GPU costs about $4000 while a reasonably one is around $500, with is much cheaper than a traditional supercomputer and requires less energy to work [10]. Many researchers around the world have realized the power and capabilities of GPGPU [9]. As result, a lot of work is being done to parallelize existing problems or to deal with unaffordable tasks until now. There are numerous examples of GPGPU applied to any kind of scientific discipline. Focusing in the mechanics field, the references are prof. Negrut, prof. Tasora et al, who are currently working on the parallelization of multi-physics dynamic systems involving millions of frictional contacts and bilateral mechanical constraints [11],[12],[13],[14]. This work presents a reduced DOF leg mechanism with linkage architecture. A kinematic model is formulated and implemented over MATLAB and C++ platforms. Computing capabilities of GPU are explored in order to accelerate the execution of algorithms and analyze its behaviour. Finally, the results obtained will be analyzed and conclusions drawn.

Transcript of The Evolution of the Computing Time in the simulation of ... · upper link of the pantograph, ......

The 14th IFToMM World Congress, Taipei, Taiwan, October 25-30, 2015 DOI Number: 10.6567/IFToMM.14TH.WC.OS2.030

The Evolution of the Computing Time in the simulation of Mimbot-Biped Robot using Parallel Algorithms

A. Bustos H. Rubio J. C. García Prada J. Meneses University Carlos III of Madrid

Madrid, Spain

Abstract: This work presents a biped walking robot with a reduced number of degrees of freedom. Based on classical mechanisms, MIMBOT robot is able to emulate advanced human skills. Their topologies are described as well as mathematical models are proposed in order to solve the kinematics. Resulting explicit equations from this model are implemented in several computing platforms in two ways, the classical one using the CPU to execute the routines and another one taking advance of the parallel capabilities of GPUs. Several simulations are carried out over all platforms with the aim of analyzing the behaviour of the hardware and software used. Finally, results are presented showing that not always parallelizing the simulation is the best option. Keywords: GPGPU, parallel computing, biped, walking robot, simulation

I. Introduction Legged locomotion, especially biped walking, has aroused great interest from universities and companies in recent decades. Lots of robots have been built using several kinds of leg [1]. Most of these biped robots copy the structure of the human leg. However, this approach is really complex and presents a series of disadvantages as high weight, numerous driven joints and a sophisticated electronic control [2].

Another way to address this problem is through classical mechanisms of 1 degree of freedom (DOF). Thus, it is obtained simpler robots whit few motors that also simulate the human gait with lower energy consumption. Nevertheless, especially prof. Ceccarelli and his team at Laboratory of Robotics and Mechatronics (LARM) have been continuously working in this field. Proof of this are the biped robots EP-WAR, CALUMA and other legged walking robots using classical mechanisms [3], [4], [5].

The PASIBOT biped walking robot, Fig. 1, arises from the cooperation between LARM and MAQLAB research groups, about which several studies have been published [6], [7], [8]. This robot follows the principles about simple design. Indeed, it is a 1 DOF mechanical system based on the intelligent combination of classical mechanisms. Despite this simplicity, it reproduces the human walking. The evolution of the mechanical system gives the robot new skills thanks to the addition of two linear actuators on each hip [8]. This evolution also implies the change of name: it will be called MIMBOT from now. Kinematic models have been developed for this biped robot, requiring the solution of numerous simulations with large computational costs.

[email protected] [email protected] [email protected] [email protected]

Fig. 1. Prototype of PASIBOT robot.

Graphics Processing Units (GPU) have a collection of attributes (hundred or thousand of cores, fast access to onboard memory, high level of task parallelization, etc.) [9] that make them really interesting to compute iterative and/or big problems. In 2006, NVIDIA introduces the CUDA (Compute Unified Device Architecture) on its GPUs. This new structure, along with the SDKs (Software Development Kit) and APIs (Application Programming Interface) developed, has lead computation to a new level, creating the GPGPU (General-Purpose Computing on Graphics Processing Units) and saving huge amounts of time on computing big and complex problems. Another remarkable aspect is their cost: the most advanced GPU costs about $4000 while a reasonably one is around $500, with is much cheaper than a traditional supercomputer and requires less energy to work [10].

Many researchers around the world have realized the power and capabilities of GPGPU [9]. As result, a lot of work is being done to parallelize existing problems or to deal with unaffordable tasks until now. There are numerous examples of GPGPU applied to any kind of scientific discipline. Focusing in the mechanics field, the references are prof. Negrut, prof. Tasora et al, who are currently working on the parallelization of multi-physics dynamic systems involving millions of frictional contacts and bilateral mechanical constraints [11],[12],[13],[14].

This work presents a reduced DOF leg mechanism with linkage architecture. A kinematic model is formulated and implemented over MATLAB and C++ platforms. Computing capabilities of GPU are explored in order to accelerate the execution of algorithms and analyze its behaviour. Finally, the results obtained will be analyzed and conclusions drawn.

II. The biped robot MIMBOTThe biped robot MIMBOT is a reduced DOF mechanical system based on the clever combination of classical mechanisms to emulate the human walking. A Tchebyshev's mechanism generates an almost straight line trajectory and gives the robot its basic movement. Next, a pantograph mechanism reveres and amplifies it, so the final gait is obtained. The pantograph's central point (point M) is fixed to the hip. This mechanism is driven by theshort side (point B) and the final gait trajectory is obtained on the long side (point A). Finally, a stabilization system is added to give structural consistency and to ensure the foot will be parallel to the ground without additional actuations.

On MIMBOT, two small linear actuators (one horizontal and one vertical) are added on each side of the robot's hip, giving some mimetic skills. The actuator's rod is attached to the fixed point of pantograph mechanism (point M). This way, the variation of point M will modify the pantograph's geometry, and consequently, the gait (trajectory of point A) will change to get the desired mimetic skills.

The biped's evolution is divided in five different steps, as shown in Fig. 2. First step corresponds to a simple scheme of the leg, made up of the two main mechanisms (Tchebyshev and pantograph mechanisms). The second one adds the stabilization mechanism and the third one, the linear actuators located in the hip. Step four is based on the second one, but the stabilization system is evolved to guarantee the foot is always parallel to the ground. Finally, in step five, the linear actuators are added to the model with the new stabilization system.

Fig. 2. Five steps in evolution of MIMBOT

A. Topological characteristics As explained above, MIMBOT robot consist of three

classical mechanisms arranged to obtain a gait similar to

that human beings do. These three mechanisms can be decomposed in four

kinematic chains for further analysis. The first chain comprises the Tchebyshev’s mechanism, highlighted in Fig. 3a. The second one is the pantograph mechanism, highlighted in Fig. 3b. The last two chains involve the stabilization mechanism and some links from the pantograph mechanism. The third chain consists of the upper section of the stabilization mechanism and the upper link of the pantograph, elements highlighted in Fig. 3c. Last chain comprises the lower part of the stabilization mechanism and the longest link of pantograph mechanism. This chain is highlighted in Fig. 3d.

a b

c d

Fig. 3. Kinematic chains used to solve the kinematics.

Link dimensions of MIMBOT are specified in Table I and Fig. 4. These dimensions are chosen in such a way that the gait of the biped robot has a length similar to an adult human being. As consequence, the height of the robot is similar to the height of a human leg. The Tchebyshev's mechanism chosen generates a straight line of approximately 130 mm. Given the relationship between the links forming the pantograph mechanism, this straight trajectory is amplified up to approximately 280 mm. In addition, it reverses the trajectory of generated by the Tchebyshev's mechanism, leaving the flight phase of the movement above the ground. Last, the stabilization mechanism replicates the longest links of the pantograph at a given distance to avoid possible collisions between the mechanisms.

Fig. 4. Main kinematic chains of MIMBOT robot.

Link Dimension (mm) Relationship 1 60 2r2 30 r3 150 5r4 75 2.5r5 180 6r6 270 9r7 180 6r8 540 18r9 270 9r

10 540 18r11 90 3r12 90 3r

Table I. Link dimensions for MIMBOT robot.

The original stabilization system involves a link joining the upper points of the pantograph and stabilization mechanisms, as well as the output from the Tchebyshev’s mechanism. One of its ends moves along a slider placed in the hip, as shown in Fig. 5a. This layout, although it provides structural consistency and support, allows the foot to tilt during the flight phase of the gait. To correct this behaviour a new stabilization system is designed. This new mechanical system consists of two sliders perpendicular each other, as shown in Fig. 5b. The vertical slider moves along the hip with the “fix element” of the horizontal one attached to it. This improvement doesn't restrict the output of the Tchebyshev's mechanism at the time it keeps the foot always parallel to the ground.

Fig. 5. Comparison between the two stabilization systems. The original one is at left and the new one is at right.

B. Linear actuators The addition of a pair of linear actuators in each leg allows the execution of advanced mimetic skills. These

new capabilities include complex tasks such as lengthening the step, making turns, going up and down stairs or avoiding obstacles of certain height. In order to carry out these new functions, 6 predefined movements are implemented on the linear actuators, allowing a total of 8 possibilities of performance. The linear actuator's rods are attached to point M in the hip, so any movement of actuators will modify the point M position.

Fixed career or null (both actuators): point M is set up to a certain position, the original one or not. In any case, the point M doesn't move during the biped's walk. It can be used to change the centre of mass inside the stability rectangle defined by the feet resting on the ground.

Front modification of step (horizontal actuator): this movement modifies the path of the foot to adjust the length of the step at the front of it, according to a predetermined path feature designed for this purpose. The value set determines the actuator stroke and the gait will be lengthened (positive values) or shortened (negative values) in consequence. If the value entered is zero, the horizontal actuator is inactive.

From previous studies [6], it is established that the foot rests in the ground 60% of cycle time, while the other 40% it is in the air. The trajectory of the linear actuator is designed taking this into account. This way, the main movement takes place during the flight phase, returning the rod to its original position during the support phase. Mathematical expressions defining the movement of the actuators are those showed in equations (1-4).

0

11( )

M M

k vx t x sin t v t

k

(1)

1 1( ) cos

Mx t v t vk

(2)

1( )

Mv

x t sin tk k

(3)

1

( )

cv

ksin T k T

(4)

Where c is the stroke length, T is the cycle time, t is the time elapsed since the start of this movement and k is a parameter that varies according to the movement chosen and the flight or support phase of the gait.

Rear modification step (horizontal actuator): This is the same feature as the previous one, but in this case the variation is at the back of step.

Ascension of the foot (vertical actuator): This feature modifies the maximum height of the foot and acts only during the flight phase. The only required input parameter is the actuator stroke. As previously, a positive value will raise the foot and a negative one will lower it. The equations defining the movement of this skill are similar to equations (1-4).

Inclination of the foot (vertical actuator): This skill also modifies the height of the foot, but this time this change takes place at the last time of flight phase. The foot is elevated during all the phase of flight and returns to the original cycle during the support phase. As usually, if the value set up is positive, the foot will raise; if this value is negative, it will fall.

Trapezoidal (both actuators): The last predefined

movement is a combination of two uniformly accelerated linear motions (at the beginning and end of actuator stroke) and a middle section with uniform linear motion. Therefore, it requires setting the initial position, velocity and acceleration as well as the acceleration for the last section and the times in which the change from one motion to another takes place.

III. Kinematic ModelOnce the closure equations corresponding to each kinematic chain are posed and solved, it is possible to find the explicit expressions that define the kinematics of every link. These equations are stated as functions of the input crank (θ2). Taking this into account, the angular positions of any link can be referred as shown in equation (5):

2( ), 1, 2, ... , 1', 2 ', ...i i i (5)

Then, the X-Y coordinates for its centre of mass, can be easily expressed with respect to that angle as equation (6) shows:

2 2( ); ( ); 1, 2, ...CDM CDM CDM CDMi i i iX X Y Y i (6)

Furthermore, if the time dependant function for the input crank is known, these equations can be expressed as time dependant functions, and then, velocities and accelerations are achieved by taking their first and second derivatives.

Four kinematic chains make up the biped kinematics, corresponding to the Tchebychev and pantograph mechanisms and upper and lower sections of the stabilization system. Using the Raven method it is possible to write the following four equation systems, equations (8-11), where ri is the length of link "i", that solve the kinematic problem:

The chain corresponding to the Tchebychev (or upper) mechanism is formed by links 1 (ground), 2, 3 and 4, so the closure equation for this case will be (see Fig. 4):

31 2 41 2 3 4 0jj j jr e r e r e r e (8)

Pantograph (or lower) mechanism is formed by links 5, 6, 7 and 8. Given the topological characteristics of the pantograph, the closure equation can be reduced to:

6 86 8 0j jMB r e r e

(9)

Where MB

is the vector joining points “M” and “B”. The position of point “B” is the trajectory generated by the Tchebyshev's mechanism, whereas the position of point “M” is fixed in the hip or given by the movement of the linear actuators.

Finally, the stabilization system is formed by links 9, 10, 11, 12 and the stabilization bar. Here, there are two closed loops: the upper one, made up of links 6 (length from “B” to “C”), 9, 11 and the stabilization bar; and the lower, with links 8 (length from “C” to “A”), 10, 11 and 12. So the two closure equations are the following ones:

7 91111 9

j jj jC fr e r e r e r e (10)

8 1012 1112 11 10

j jj jAr e r e r e r e (11)

Where δ is the angle defined between the horizontal

axis and the stabilization bar, counter-clockwise. Solving the equation system corresponding to the

Tchebychev's mechanism we can express the angles θ4 and θ3, see equations (12) and (13), as functions of θ2 as aaux, baux and caux are auxiliary variables dependant on θ2.

2 2 2

14 2 2

2 2cos

2aux aux aux aux aux aux

aux aux

a c b b c a

a b

(12)

1 4 43

3

cos2 cos auxa r

r

(13)

The same process is followed for the equation system corresponding to the pantograph mechanism, obtaining angles θ8 and θ6 from equations (14) and (15), respectively. As previously, faux is an auxiliary variable dependant on θ2 defined in order to compact the equation.

2 2 2 281

8 2 28

4cos

2

aux MB MB MB MB aux

MB MB

f x y r x y f

r x y

(14)

1 8 86

6

coscos MBx r

r

(15)

The angle θ7 is obtained from a simple trigonometric relation between it and θ6, as shows equation (16).

1 8 87 6

6

coscos MBx r

r

(16)

Finally, solving the two equation systems corresponding to the upper and lower sections of the stabilization mechanism, we obtain the expressions stated in equations (17-20) that determine the angles θ9 and θ11 first, and θ10 and θ12 then.

2 2 2 291

9 2 29

4cos

2

g aux g g g aux

g g

a x b r a b c

r a b

(17)

7 9 9111

11

cos cos coscos f Cr r r

r

(18)

2 2 2 2101

10 2 210

4cos

2

p aux p p p aux

p p

a c b r a b c

r a b

(19)

1 11 11 8 10 1012

12

cos cos coscos Ar r r

r

(20)

Again, ag, bg, ap, bp, and caux are auxiliary variables, also dependant on θ2, in order to reduce the size of the equations.

Next step is to obtain the position of every point of interest, as well as the position of the centres of masses of all links, a trivial task once we know their angular positions and their lengths. In addition, taking the first time derivative of these equations will establish the angular velocities of every link (and, subsequently, the linear velocity of their centres of masses) and taking the second one, the angular accelerations (and the linear

acceleration of their centres of masses). Therefore, the whole kinematics of the MIMBOT legs is defined.

IV. Parallel Programming, ImplementationAs stated in reference [15], parallel computing refers to the process of appoint multiple hardware assets (cores or processors) to simultaneously execute multiple sequences of instructions or multiple instances of data. Following Flynn's taxonomy of parallel computers [16], the latter covers the Single Instruction Multiple Data (SIMD) architecture, which refers to applications that execute the same sequence of instructions at the same time on multiple instances of data. The other big group is the Multiple Instruction Multiple Data (MIMD) architecture, in which applications perform multiple sequences of instructions at the same time. By their main characteristics, made up of thousands of cores working on parallel coordinated by a schedule engine and fast access to data registers, GPUs can be classified inside the SIMD architecture. We will develop our parallel algorithm based on this architecture.

In a typical PC, the CPU and GPU (and their memories) are physically distinct and connected by a PCI Express bus. In GPGPU, input data must be entered through the CPU and then copied to the GPU. Same task in reverse direction must be done with output data. That is the main disadvantage of GPGPU, as the PCIe bus speed is much lower than the speed between the CPU and RAM and the GPU and its memory, what creates a bottle neck impossible to avoid. Anyway, the programmer must deal with this and allocate sufficient space in both memories to copy shared data between CPU and GPU. Fortunately, with CUDA 6 it is introduced one of the most dramatic programming model improvements in the history of this platform, Unified Memory (UM). This new feature creates a pool of managed memory that is shared between the CPU and GPU and accessible to both the CPU and GPU using a single pointer [17], [18].

The key aspect for the programmer is that, now, he only sees a pool of memory and doesn't have to worry about allocating and transfer data between both memories. UM does this work by the programmer, simplifying the memory management in GPU and allowing a cleaner and more elegant code. It doesn't improve the execution speed, but makes life easier.

Parallelizing the computation of the kinematic model developed has a series of advantages very useful for our purposes. In one hand, the explicit equations developed have enough computational cost to justify the programming effort of parallelizing the existing code. On the other hand, time spent in the computation is sufficiently short to make affordable batteries of tests. In addition, full control over the tests is achieved by computing those explicit equations. This method assures us that all codes developed have to solve the same number of equations.

Both sequential and parallel algorithms are developed to compute the kinematical model expounded and implemented over C++ and MATLAB. The sequential algorithm follows the flowchart shown in Fig. 6. After the program starts, the link sizes, the input variables and the number of iterations and time increment for every step are set up. Next step is to choose the kinematic model to solve

among the five predefined. After that, the computation of the 179 equations defining the model takes place. The C++ version uses a for loop, so after calculating the first step the results obtained are stored in the memory and the index N counting the number of iteration is refreshed until reach the exit condition. In the MATLAB version we use the vectorization, which makes the for loop unnecessary once the vector dimensions are defined. Finally, the obtained results corresponding to the whole simulation are stored in several files and plotted for a visual analysis, ending the program here. Additionally, given the differences between the programming languages, the required memory for the variables must be allocated in C++, a task that is not needed in MATLAB.

Fig. 6. Flowchart for the sequential algorithm over C++.

The parallel algorithm is also implemented over C++ (with the help of the CUDA toolkit) and MATLAB. The parallelized MATLAB version is almost identical to the sequential one, just adding the required lines for the parallel computation.

However, CUDA C++ requires more work, but it also allows to have full access to the memory and parallel threads management. As result, the code will be more efficient and the execution, faster. In this implementation, the “for loop” has been replaced for n number of threads, such as time steps to compute. Obviously, these threads will be launched in parallel in the GPU, so each thread will compute each instant of time for all the output data of

the subroutine executed. The flowchart of the parallel algorithm implemented

over CUDA C++ is shown in Fig. 7. The first steps are identical for all versions: after starting, the link sizes, the input variables and the number of iterations and time increment for every step are set up; as well as the kinematic model is chosen. According to this model, the set of equations to solve is arranged. At this point is where the particularities of the CUDA C++ algorithm begin. First, the attributes of the available GPU are identified, that is, the maximum number of threads per block (NMthread), maximum number of blocks (NMblock) and maximum grid size (NMgrid). These are key numbers because they determine the maximum number of threads it is possible to launch in parallel and how it must be done. Basically, they tell the programmer how many threads per block and how many blocks can be launched. After that, the required Unified Memory to store all the variables that will be used by the GPU is allocated. Next, they are launched so many parallel threads as iterations have in order to compute the kinematics. This process is made within the GPU capabilities, splitting the threads in the number of blocks needed to not exceed the maximum number of threads per block. The obtained results from the parallel computation are stored and memories of both GPU and CPU synchronized. Finally output data is stored in several plain text files and results are plotted.

Fig 7. Flowchart for the parallelized algorithm over CUDA C++.

V. Numerical Results

We tested both serial and parallelized algorithms over C++ and MATLAB and compared them in terms of computing time. First column, size, in Tables II and III shows the number the number of iterations or time steps computed in every simulation. Four big columns (divided each one into two sub-columns) present the time spent in the computation of every simulation and the X-factor. This measured time is only the time spent in solving the equations, it doesn't take into account the time for allocating memory or other tasks. The X-factor is the relation between any simulation and the CUDA C++ simulation.

We carried out experiments at seven different levels of computational cost for every algorithm and platform, and for the two stabilization systems analyzed, as they have different grades of complexity. Every experiment consists on a total of 20 tests in order to have enough data for statistical purposes. In total, 560 simulations have been conducted, solving up to more than 3.5 millions of equations per test.

The test computer is an Intel Xeon E5410 at 2.33 GHz with 6 GB of RAM and a GPU NVIDIA GeForce GTX 660 Ti. The software used was MATLAB R2013b and Visual Studio 2010 Professional with CUDA Toolkit 6.

CUDA C++ C/C++ MATLAB

CUDA MATLAB

Size Time (ms)

X Factor

Time (ms)

X Factor

Time (s)

X Factor

Time (ms)

X Factor

201 0.58 1.00 2.25 3.87 3.52 6049 4.56 7.84

601 0.92 1.00 6.85 7.46 9.87 10754 10.73 11.69

1101 1.37 1.00 12.25 8.91 17.97 13077 18.81 13.69

2501 1.38 1.00 22.35 16.17 43.71 31621 34.44 24.91

5001 1.39 1.00 56.00 40.16 101.80 73013 52.82 37.88

10001 2.69 1.00 111.90 41.55 297.06 110292 86.12 31.97

20001 4.03 1.00 232.95 57.74 764.86 189591 171.45 42.50

Table II. Results for the original stabilizer.

CUDA C++ C/C++ MATLAB

CUDA MATLAB

Size Time (ms)

X Factor

Time (ms)

X Factor

Time (s)

X Factor

Time (ms)

X Factor

201 0.52 1.00 1.85 3.58 0.61 1188 3.15 6.10

601 0.80 1.00 5.50 6.86 1.36 1689 8.14 10.15

1101 1.19 1.00 10.30 8.65 2.42 2027 14.48 12.15

2501 1.20 1.00 18.85 15.75 5.10 4263 22.82 19.06

5001 1.22 1.00 46.40 38.09 12.10 9935 34.96 28.70

10001 2.33 1.00 91.75 39.38 16.78 7204 54.27 23.29

20001 3.49 1.00 190.65 54.68 33.70 9664 114.01 32.70

Table III. Results for the new stabilizer.

Fig. 8 shows the execution times on a graph. The first point to emphasize is the astonishing poor performance of the combination MATLAB and GPU, also called MATLAB CUDA. In theory, executing MATLAB algorithms on the GPU is faster than executing the same algorithm on the CPU. However, our tests show the effect is the opposite and the GPU takes a huge amount of time to solve the same equations. In MATLAB help there are several examples about parallel computing using GPUs and also show that it is slower than the usual computation. But it is also commented that the use of "code

vectorization" (applied in is this case) improves the performance and people from Mathworks obtain better results with parallel computing. Nevertheless, we realized that any kind of control sentence, as it can be an "if" or a "for loop", really kills its performance. In GPU code over MATLAB these statements should be avoided, but there are some cases where this is not possible, with terrible consequences in terms of execution time.

Fig 8. Time spent to solve the problem.

Leaving this fact aside, it is easily visible that, as expected, the best performance is achieved with CUDA C++, while C++ and MATLAB codes takes longer times to execute the same tasks, and the gain in time is greater as the size problem increases.

Another interesting point is that MATLAB is faster than C++ when the number of points is above 5000. The reason could be the vectorization used in MATLAB against the for loop in C++. MATLAB is designed to work in this way, so the behaviour is the expected; maybe, the most surprising is that it is not so good for relatively small amounts of data.

Fig. 9 shows the relative speedups gained using CUDA C++ against the C++ and MATLAB codes. The results against MATLAB CUDA are not shown on the graph because of the really poor data obtained and because it will also difficult the analysis of other data. As expected, and inferred from data in tables or Fig. 8, the speedup is greater as greater is the problem to compute. The maximum speedups are achieved for the largest problem, obtaining values around 35x for CUDA C++ vs. MATLAB and around 55x for CUDA C++ vs. C++. Here it is clearly visible the fact mentioned above: MATLAB is

faster than C++ from a certain problem size.

Fig. 9. Relative speedups of CUDA C++ against C++ and MATLAB

Figure 10. Execution time per iteration for the original stabilization system.

Figure 11. Execution time per iteration for the new stabilization system.

Analyzing the time expended for every iteration (Fig. 10 and Fig. 11) we can extract interesting conclusions

1,E‐01

1,E+00

1,E+01

1,E+02

1,E+03

1,E+04

1,E+05

1,E+06

1 5001 10001 15001 20001

Time (ms)

Problem size

Execution time

CUDA C++ original stabilizer

C++ original stabilizer

MATLAB CUDA original stabilizer

MATLAB original stabilizer

CUDA C++ new stabilizer

C++ new stabilizer

MATLAB CUDA new stabilizer

MATLAB new stabilizer

0

10

20

30

40

50

60

70

201 601 1101 2501 5001 10001 20001

Multiplication  factor

Problem size

Relative Speedups

Original stabilizer CUDA C++ vs C++

Original stabilizer CUDA C++ vs MATLAB

New stabilizer CUDA C++ vs C++

New stabilizer CUDA C++ vs MATLAB

1E‐4

1E‐3

1E‐2

1E‐1

1E+0

1E+1

1E+2

201 601 1101 2501 5001 10001 20001

Time/Size

Problem size

CPU: Intel Xeon E5410GPU: nVidia GeForce GTX600 Ti

C++ MATLAB CUDA C++ MATLAB CUDA

1E‐4

1E‐3

1E‐2

1E‐1

1E+0

1E+1

201 601 1101 2501 5001 10001 20001

Time/Size

Problem size

CPU: Intel Xeon E5410GPU: nVidia GeForce GTX600 Ti

C++ MATLAB CUDA C++ MATLAB CUDA

about the performance of each algorithm. As noted in figures, the code developed in CUDA C++ shows a steep slope downwards, so it takes less time to compute every iteration as this number increases. Or said in another way, CUDA C++ is faster as the problem size grows. For the 201 iterations, the value achieved is about 0.0027 ms per iteration, while for 20001 it is only 0.0002 ms per iteration, a factor of 10. On the other hand, for the C++ code, the relation between time spent and size is almost constant around a value of 0,01ms/iteration. MATLAB version presents a slightly down slope around the same magnitude order than C++. Finally, no coherent relation can be established for the code over MATLAB CUDA as it shows an erratic behaviour in the situations studied.

VI. ConclusionsIn this work it has been presented a reduced DOF biped walking robot. Although its simplicity, it is able to perform advanced human skills thanks to a set of linear actuators strategically placed. A parametric model describing its kinematics is stated, as well as the motion of the linear actuators.

Full control over the tests is achieved by computing explicit equations. This assures that all codes solve the same number of equations and the same trigonometric and exponential operations.

Several implementations of this model have been developed over the same number of platforms, namely C++, CUDA C++, MATLAB and MATLAB CUDA. Implementation over CUDA C++ involves not only the memory management, as in C++, but also the management of the parallelization process, determining the number and the distribution of the parallel thread to be launched. This allows to get full control over the critical parameters of the parallel computing.

The analysis of the results obtained in our tests shows the CUDA C++ version is the fastest of all. Between the implementation over C++ and over MATLAB there are no major differences, as the computation times obtained are similar. Finally, the option of solving the problem over MATLAB CUDA is remarkably (and surprisingly) the slowest of all.

As discussed in literature, the benefit of using parallel computing through GPUs is greater the larger is the size of the problem to solve. Proof of this is that the highest speedups are achieved for the largest problem sizes. Likewise, analysis of the time spent by each iteration for the CUDA C++ algorithm shows that it is reduced as the problem size grows.

Acknowledgment Authors wish to thank professors Marco Ceccarelli and Giuseppe Carbone, from the Laboratory of Robotics and Mechatronics at University of Cassino and South Latium, their valuable contribution and support to this work.

References [1] Carbone, G. and Ceccarelli, M. Legged robotic systems. Cutting

Edge Robotics. ARS Scientific Book, Wien, pp 553–576, 2005. [2] Liang, C. and Ceccarelli, M. and Takeda, Y. Operation analysis of a

Chebyshev-Pantograph leg mechanism for a single DOF biped

robot. Frontiers of Mechanical Engineering. 7 (4), pp 357–370, 2012.

[3] Nava, N. E. and Carbone, G. and Ceccarelli, M. Design Evolution of Low-Cost Humanoid Robot CALUMA. Proceedings 12th IFToMM World Congress, Besançon (France), 2007.

[4] Liang, C. and Ceccarelli, M. and Carbone, G. Design and Simulation of Legged Walking Robots in MATLAB® Environment. In book: Dr. Karel Perutka. MATLAB for Engineers - Applications in Control, Electrical Engineering, IT and Robotics, 2011. ISBN: 978-953-307-914-1

[5] Li, T. and Ceccarelli, M. A leg design for a biped humanoid service robot. Proceedings of CLAWAR 2011: the 14th International Conference on Climbing and Walking Robots, Paris (France), pp. 893-900, 2011.

[6] Escobar, M. E. and Rubio, H. and García-Prada, J.C. Analysis of the Stabilization System of Mimbot Biped. Journal Applied Research and Technology. 10 (2), pp. 206-214, 2012.

[7] Meneses, J. and Castejón, C. and Rubio, H. and García, J.C. Kinematics and dynamics of the quasi-passive biped PASIBOT. Journal of Mechanical Engineering. 57 (12), pp. 879-887, 2011.

[8] Rubio, H. and Bustos, A. and Castejón, C. and Meneses, J. and García-Prada, J.C. Tool for the Analysis of New Skills Biped Pasibot. New Advances in Mechanisms, Transmissions and Applications. Mechanisms and Machine Science, Volume 17, 2014, pp 173-181. Proceedings of the Second Conference MeTrApp 2013. Springer. ISBN 978-94-007-7485-8

[9] Ghorpade, J. and Parande, J. and Kulkarni, M. and Bawaskar, A. GPGPU processing in CUDA architecture. Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.1, January 2012.

[10] Castonguay, P. Accelerating CFP simulations with GPUs. Clumeq NVIDIA CUDA/GPU workshop. Montreal (Canada), 2012.

[11] Negrut, D. and Tasora, A. and Anitescu, M. and Mazhar, H. and Heyn, T. and Pazouki, A. Solving Large Multibody Dynamics Problems on the GPU. In book: Wen-mei Hwu. GPU Computing Gems. p. 269-280, SAN FRANCISCO: Morgan Kaufmann. 2011. ISBN: 9780123859631.

[12] Heyn, T. and Mazhar, H. and Tasora, A. and Anitescu, M. and Negrut, D. A Parallel Algorithm for Solving Complex Multibody Problems with Stream Processors. Proceedings of ECCOMAS Multibody Dynamics Conference. Warsaw (Poland), 2009.

[13] Tasora A. and Negrut D. A parallel algorithm for solving complex multibody problems with streaming processors. International journal for computational vision and biomechanics, 1(2), pp. 131-143, 2009. ISSN: 0973-6778

[14] Tasora, A. and Anitescu, M. A matrix-free cone complementarity approach for solving large-scale, nonsmooth, rigid body dynamics. Computer Methods in Applied Mechanics and Engineering, 200, pp 439 - 445, 2011.

[15] Negrut, D. and Radu, S. and Hammad, M. et al. Parallel Computing in Multibody System Dynamics: Why, When, and How. Journal of Computational and Nonlinear Dynamics. Volume 9, Issue 4, July 2014.

[16] Flynn, M. J. Some computer organizations and their effectiveness. Computers, IEEE Transactions on, 100(9), pp. 948–960, 1972

[17] Negrut, D. and Serban, R. and Li, A. and Seidl, A. Unified Memory in CUDA 6.0. A Brief Overview of Related Data Access and Transfer Issues. SBEL Technical Report. June 2014.

[18] Harris, M. Unified Memory in CUDA 6 . NVIDIA Developer Zone. http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/ November 2013. (Last access: February 2015).