How to Optimize Data Transfers in CUDA C_C++ _ Parallel Forall

4

Click here to load reader

description

C+++ CUda

Transcript of How to Optimize Data Transfers in CUDA C_C++ _ Parallel Forall

Page 1: How to Optimize Data Transfers in CUDA C_C++ _ Parallel Forall

Next → (http://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-fortran/)

← Previous (http://devblogs.nvidia.com/parallelforall/how-optimize-data-

transfers-cuda-fortran/)

Search DevZone

Developer Programs (https://developer.nvidia.com/registered-developer-programs)

DEVELOPER CENTERS(HTTP://DEVELOPER.NVIDIA.COM) TECHNOLOGIES(HTTPS://DEVELOPER.NVIDIA.COM/TECHNOLOGIES)

TOOLS(HTTPS://DEVELOPER.NVIDIA.COM/TOOLS) RESOURCES(HTTPS://DEVELOPER.NVIDIA.COM/SUGGESTED-READING)

COMMUNITY(HTTPS://DEVELOPER.NVIDIA.COM/)

Share:

(http://www.addthis.com/bookmark.php?v=300&winname=addthis&pub=xa-51d1f0cb3c546e3d&source=tbx-300&lng=es-

419&s=reddit&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-optimize-data-transfers-cuda-

cc%2F&title=How%20to%20Optimize%20Data%20Transfers%20in%20CUDA%20C%2FC%2B%2B&ate=AT-xa-51d1f0cb3c546e3d/-/-

/543d434364e8faf7/2&frommenu=1&uid=543d4343a45d8cbd&ct=1&pre=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-

transfers-cuda-cc%2F&tt=0&captcha_provider=nucaptcha)

(http://www.addthis.com/bookmark.php?v=300&winname=addthis&pub=xa-51d1f0cb3c546e3d&source=tbx-300&lng=es-

419&s=google_plusone_share&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-optimize-data-transfers-cuda-

cc%2F&title=How%20to%20Optimize%20Data%20Transfers%20in%20CUDA%20C%2FC%2B%2B&ate=AT-xa-51d1f0cb3c546e3d/-/-

/543d434364e8faf7/3&frommenu=1&uid=543d43433f281bd3&ct=1&pre=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-

transfers-cuda-cc%2F&tt=0&captcha_provider=nucaptcha)

(http://www.addthis.com/bookmark.php?v=300&winname=addthis&pub=xa-51d1f0cb3c546e3d&source=tbx-300&lng=es-

419&s=linkedin&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-optimize-data-transfers-cuda-

cc%2F&title=How%20to%20Optimize%20Data%20Transfers%20in%20CUDA%20C%2FC%2B%2B&ate=AT-xa-51d1f0cb3c546e3d/-/-

/543d434364e8faf7/4&frommenu=1&uid=543d434333fc68ef&ct=1&pre=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-

transfers-cuda-cc%2F&tt=0&captcha_provider=nucaptcha)

How to Optimize Data Transfers in CUDA C/C++

Posted on December 4, 2012 (http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/) by Mark Harris(http://devblogs.nvidia.com/parallelforall/author/mharris/) No CommentsTagged CUDA C/C++ (http://devblogs.nvidia.com/parallelforall/tag/cuda-cc/), Memory (http://devblogs.nvidia.com/parallelforall/tag/memory/),Profiling (http://devblogs.nvidia.com/parallelforall/tag/profiling/)

In the previous (http://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/) three (http://devblogs.nvidia.com/parallelforall/how-

implement-performance-metrics-cuda-cc/) posts (http://devblogs.nvidia.com/parallelforall/how-query-device-properties-and-handle-errors-cuda-cc/)

of this CUDA C & C++ series we laid the groundwork for the major thrust of the series: how to optimize CUDA C/C++ code. In this and the following post

we begin our discussion of code optimization with how to efficiently transfer data between the host and device. The peak bandwidth between the

device memory and the GPU is much higher (144 GB/s on the NVIDIA Tesla C2050, for example) than the peak bandwidth between host memory and

device memory (8 GB/s on PCIe x16 Gen2). This disparity means that your implementation of data transfers between the host and GPU devices can

make or break your overall application performance. Let’s start with a few general guidelines for host-device data transfers.

Minimize the amount of data transferred between host and device when possible, even if that means running kernels on the GPU that get

little or no speed-up compared to running them on the host CPU.

Higher bandwidth is possible between the host and the device when using page-locked (or “pinned”) memory.

Batching many small transfers into one larger transfer performs much better because it eliminates most of the per-transfer overhead.

Data transfers between the host and device can sometimes be overlapped with kernel execution and other data transfers.

We investigate the first three guidelines above in this post, and we dedicate the next post to overlapping data transfers. First I want to talk about how

to measure time spent in data transfers without modifying the source code.

Measuring Data Transfer Times with nvprof

To measure the time spent in each data transfer, we could record a CUDA event before and after each transfer and use cudaEventElapsedTime(), as

we described in a previous post (http://devblogs.nvidia.com/parallelforall/how-implement-performance-metrics-cuda-cc/). However, we can get the

elapsed transfer time without instrumenting the source code with CUDA events by using nvprof, a command-line CUDA profiler included with the CUDA

Toolkit (starting with CUDA 5). Let’s try it out with the following code example, which you can find in the Github repository for this post

(https://github.com/parallel-forall/code-samples/blob/master/series/cuda-cpp/optimize-data-transfers/profile.cu).

i n t m a i n ( )

{

c o n s t u n s i g n e d i n t N = 1 0 4 8 5 7 6 ;

c o n s t u n s i g n e d i n t b y t e s = N * s i z e o f ( i n t ) ;

i n t * h _ a = ( i n t * ) m a l l o c ( b y t e s ) ;

i n t * d _ a ;

c u d a M a l l o c ( ( i n t * * ) & d _ a , b y t e s ) ;

m e m s e t ( h _ a , 0 , b y t e s ) ;

c u d a M e m c p y ( d _ a , h _ a , b y t e s , c u d a M e m c p y H o s t T o D e v i c e ) ;

c u d a M e m c p y ( h _ a , d _ a , b y t e s , c u d a M e m c p y D e v i c e T o H o s t ) ;

r e t u r n 0 ;

}

To profile this code, we just compile it using nvcc, and then run nvprof with the program filename as an argument.

Share

Share

Share

Page 2: How to Optimize Data Transfers in CUDA C_C++ _ Parallel Forall

$ n v c c p r o f i l e . c u - o p r o f i l e _ t e s t

$ n v p r o f . / p r o f i l e _ t e s t

When I run on my desktop PC which has a GeForce GTX 680 (GK104 GPU, similar to a Tesla K10), I get the following output.

$ n v p r o f . / a . o u t

= = = = = = = = N V P R O F i s p r o f i l i n g a . o u t . . .

= = = = = = = = C o m m a n d : a . o u t

= = = = = = = = P r o f i l i n g r e s u l t :

T i m e ( % ) T i m e C a l l s A v g M i n M a x N a m e

5 0 . 0 8 7 1 8 . 1 1 u s 1 7 1 8 . 1 1 u s 7 1 8 . 1 1 u s 7 1 8 . 1 1 u s [ C U D A m e m c p y D t o H ]

4 9 . 9 2 7 1 5 . 9 4 u s 1 7 1 5 . 9 4 u s 7 1 5 . 9 4 u s 7 1 5 . 9 4 u s [ C U D A m e m c p y H t o D ]

As you can see, nvprof measures the time taken by each of the CUDA memcpy calls. It reports the average, minimum, and maximum time for each call

(since we only run each copy once, all times are the same). nvprof is quite flexible, so make sure you check out the documentation

(http://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview).

nvprof is new in CUDA 5. If you are using an earlier version of CUDA, you can use the older “command-line profiler”, as Greg Ruetsch explained in his

post How to Optimize Data Transfers in CUDA Fortran (http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-fortran/).

Minimizing Data Transfers

We should not use only the GPU execution time of a kernel relative to the execution time of its CPU implementation to decide whether to run the GPU

or CPU version. We also need to consider the cost of moving data across the PCI-e bus, especially when we are initially porting code to CUDA. Because

CUDA’s heterogeneous programming model uses both the CPU and GPU, code can be ported to CUDA one kernel at a time. In the initial stages of

porting, data transfers may dominate the overall execution time. It’s worthwhile to keep tabs on time spent on data transfers separately from time

spent in kernel execution. It’s easy to use the command-line profiler for this, as we already demonstrated. As we port more of our code, we’ll remove

intermediate transfers and decrease the overall execution time correspondingly.

Pinned Host Memory

Host (CPU) data allocations are pageable by default. The GPU cannot access data directly from pageable host memory, so when a data transfer from

pageable host memory to device memory is invoked, the CUDA driver must first allocate a temporary page-locked, or “pinned”, host array, copy the

host data to the pinned array, and then transfer the data from the pinned array to device memory, as illustrated below.

As you can see in the figure, pinned memory is used as a staging area for transfers from the device to the host. We can avoid the cost of the transfer

between pageable and pinned host arrays by directly allocating our host arrays in pinned memory. Allocate pinned host memory in CUDA C/C++ using

cudaMallocHost() (http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__MEMORY_1g9f93d9600f4504e0d637ceb43c91ebad) or

cudaHostAlloc() (http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__MEMORY_1g15a3871f15f8c38f5b7190946845758c), and

deallocate it with cudaFreeHost() (http://docs.nvidia.com/cuda/cuda-runtime-

api/index.html#group__CUDART__MEMORY_1gedaeb2708ad3f74d5b417ee1874ec84a). It is possible for pinned memory allocation to fail, so you should

always check for errors. The following code excerpt demonstrates allocation of pinned memory with error checking.

c u d a E r r o r _ t s t a t u s = c u d a M a l l o c H o s t ( ( v o i d * * ) & h _ a P i n n e d , b y t e s ) ;

i f ( s t a t u s ! = c u d a S u c c e s s )

p r i n t f ( " E r r o r a l l o c a t i n g p i n n e d h o s t m e m o r y n " ) ;

Data transfers using host pinned memory use the same cudaMemcpy() (http://docs.nvidia.com/cuda/cuda-runtime-

api/index.html#group__CUDART__MEMORY_1g48efa06b81cc031b2aa6fdc2e9930741) syntax as transfers with pageable memory. We can use the

following “bandwidthtest” program (also available on Github (https://github.com/parallel-forall/code-samples/blob/master/series/cuda-cpp/optimize-

data-transfers/bandwidthtest.cu)) to compare pageable and pinned transfer rates.

Page 3: How to Optimize Data Transfers in CUDA C_C++ _ Parallel Forall

# i n c l u d e < s t d i o . h >

# i n c l u d e < a s s e r t . h >

/ / C o n v e n i e n c e f u n c t i o n f o r c h e c k i n g C U D A r u n t i m e A P I r e s u l t s

/ / c a n b e w r a p p e d a r o u n d a n y r u n t i m e A P I c a l l . N o - o p i n r e l e a s e b u i l d s .

i n l i n e

c u d a E r r o r _ t c h e c k C u d a ( c u d a E r r o r _ t r e s u l t )

{

# i f d e f i n e d ( D E B U G ) | | d e f i n e d ( _ D E B U G )

i f ( r e s u l t ! = c u d a S u c c e s s ) {

f p r i n t f ( s t d e r r , " C U D A R u n t i m e E r r o r : % s n " ,

c u d a G e t E r r o r S t r i n g ( r e s u l t ) ) ;

a s s e r t ( r e s u l t = = c u d a S u c c e s s ) ;

}

# e n d i f

r e t u r n r e s u l t ;

}

v o i d p r o f i l e C o p i e s ( f l o a t * h _ a ,

f l o a t * h _ b ,

f l o a t * d ,

u n s i g n e d i n t n ,

c h a r * d e s c )

{

p r i n t f ( " n % s t r a n s f e r s n " , d e s c ) ;

u n s i g n e d i n t b y t e s = n * s i z e o f ( f l o a t ) ;

/ / e v e n t s f o r t i m i n g

c u d a E v e n t _ t s t a r t E v e n t , s t o p E v e n t ;

c h e c k C u d a ( c u d a E v e n t C r e a t e ( & s t a r t E v e n t ) ) ;

c h e c k C u d a ( c u d a E v e n t C r e a t e ( & s t o p E v e n t ) ) ;

c h e c k C u d a ( c u d a E v e n t R e c o r d ( s t a r t E v e n t , 0 ) ) ;

c h e c k C u d a ( c u d a M e m c p y ( d , h _ a , b y t e s , c u d a M e m c p y H o s t T o D e v i c e ) ) ;

c h e c k C u d a ( c u d a E v e n t R e c o r d ( s t o p E v e n t , 0 ) ) ;

c h e c k C u d a ( c u d a E v e n t S y n c h r o n i z e ( s t o p E v e n t ) ) ;

f l o a t t i m e ;

c h e c k C u d a ( c u d a E v e n t E l a p s e d T i m e ( & t i m e , s t a r t E v e n t , s t o p E v e n t ) ) ;

p r i n t f ( " H o s t t o D e v i c e b a n d w i d t h ( G B / s ) : % f n " , b y t e s * 1 e - 6 / t i m e ) ;

c h e c k C u d a ( c u d a E v e n t R e c o r d ( s t a r t E v e n t , 0 ) ) ;

c h e c k C u d a ( c u d a M e m c p y ( h _ b , d , b y t e s , c u d a M e m c p y D e v i c e T o H o s t ) ) ;

c h e c k C u d a ( c u d a E v e n t R e c o r d ( s t o p E v e n t , 0 ) ) ;

c h e c k C u d a ( c u d a E v e n t S y n c h r o n i z e ( s t o p E v e n t ) ) ;

c h e c k C u d a ( c u d a E v e n t E l a p s e d T i m e ( & t i m e , s t a r t E v e n t , s t o p E v e n t ) ) ;

p r i n t f ( " D e v i c e t o H o s t b a n d w i d t h ( G B / s ) : % f n " , b y t e s * 1 e - 6 / t i m e ) ;

f o r ( i n t i = 0 ; i < n ; + + i ) {

i f ( h _ a [ i ] ! = h _ b [ i ] ) {

p r i n t f ( " * * * % s t r a n s f e r s f a i l e d * * * " , d e s c ) ;

b r e a k ;

}

}

/ / c l e a n u p e v e n t s

c h e c k C u d a ( c u d a E v e n t D e s t r o y ( s t a r t E v e n t ) ) ;

c h e c k C u d a ( c u d a E v e n t D e s t r o y ( s t o p E v e n t ) ) ;

}

i n t m a i n ( )

{

u n s i g n e d i n t n E l e m e n t s = 4 * 1 0 2 4 * 1 0 2 4 ;

c o n s t u n s i g n e d i n t b y t e s = n E l e m e n t s * s i z e o f ( f l o a t ) ;

/ / h o s t a r r a y s

f l o a t * h _ a P a g e a b l e , * h _ b P a g e a b l e ;

f l o a t * h _ a P i n n e d , * h _ b P i n n e d ;

/ / d e v i c e a r r a y

f l o a t * d _ a ;

/ / a l l o c a t e a n d i n i t i a l i z e

h _ a P a g e a b l e = ( f l o a t * ) m a l l o c ( b y t e s ) ; / / h o s t p a g e a b l e

h _ b P a g e a b l e = ( f l o a t * ) m a l l o c ( b y t e s ) ; / / h o s t p a g e a b l e

c h e c k C u d a ( c u d a M a l l o c H o s t ( ( v o i d * * ) & h _ a P i n n e d , b y t e s ) ) ; / / h o s t p i n n e d

c h e c k C u d a ( c u d a M a l l o c H o s t ( ( v o i d * * ) & h _ b P i n n e d , b y t e s ) ) ; / / h o s t p i n n e d

c h e c k C u d a ( c u d a M a l l o c ( ( v o i d * * ) & d _ a , b y t e s ) ) ; / / d e v i c e

f o r ( i n t i = 0 ; i < n E l e m e n t s ; + + i ) h _ a P a g e a b l e [ i ] = i ;

m e m c p y ( h _ a P i n n e d , h _ a P a g e a b l e , b y t e s ) ;

m e m s e t ( h _ b P a g e a b l e , 0 , b y t e s ) ;

m e m s e t ( h _ b P i n n e d , 0 , b y t e s ) ;

The data transfer rate can depend on the type of host system (motherboard, CPU, and chipset) as well as the GPU. On my laptop which has an Intel

Core i7-2620M CPU (2.7GHz, 2 Sandy Bridge cores, 4MB L3 Cache) and an NVIDIA NVS 4200M GPU (1 Fermi SM, Compute Capability 2.1, PCI-e Gen2 x16),

running BandwidthTest produces the following results. As you can see, pinned transfers are more than twice as fast as pageable transfers.

D e v i c e : N V S 4 2 0 0 M

T r a n s f e r s i z e ( M B ) : 1 6

P a g e a b l e t r a n s f e r s

H o s t t o D e v i c e b a n d w i d t h ( G B / s ) : 2 . 3 0 8 4 3 9

D e v i c e t o H o s t b a n d w i d t h ( G B / s ) : 2 . 3 1 6 2 2 0

P i n n e d t r a n s f e r s

H o s t t o D e v i c e b a n d w i d t h ( G B / s ) : 5 . 7 7 4 2 2 4

D e v i c e t o H o s t b a n d w i d t h ( G B / s ) : 5 . 9 5 8 8 3 4

On my desktop PC with a much faster Intel Core i7-3930K CPU (3.2 GHz, 6 Sandy Bridge cores, 12MB L3 Cache) and an NVIDIA GeForce GTX 680 GPU (8

Kepler SMs, Compute Capability 3.0) we see much faster pageable transfers, as the following output shows. This is presumably because the faster CPU

(and chipset) reduces the host-side memory copy cost.

D e v i c e : G e F o r c e G T X 6 8 0

T r a n s f e r s i z e ( M B ) : 1 6

P a g e a b l e t r a n s f e r s

H o s t t o D e v i c e b a n d w i d t h ( G B / s ) : 5 . 3 6 8 5 0 3

D e v i c e t o H o s t b a n d w i d t h ( G B / s ) : 5 . 6 2 7 2 1 9

P i n n e d t r a n s f e r s

H o s t t o D e v i c e b a n d w i d t h ( G B / s ) : 6 . 1 8 6 5 8 1

D e v i c e t o H o s t b a n d w i d t h ( G B / s ) : 6 . 6 7 0 2 4 6

You should not over-allocate pinned memory. Doing so can reduce overall system performance because it reduces the amount of physical memory

available to the operating system and other programs. How much is too much is difficult to tell in advance, so as with all optimizations, test your

applications and the systems they run on for optimal performance parameters.

Batching Small Transfers

Due to the overhead associated with each transfer, it is preferable to batch many small transfers together into a single transfer. This is easy to do by

using a temporary array, preferably pinned, and packing it with the data to be transferred.

For two-dimensional array transfers, you can use cudaMemcpy2D() (http://docs.nvidia.com/cuda/cuda-runtime-

api/index.html#group__CUDART__MEMORY_1g17f3a55e8c9aef5f90b67cdf22851375).

c u d a M e m c p y 2 D ( d e s t , d e s t _ p i t c h , s r c , s r c _ p i t c h , w , h , c u d a M e m c p y H o s t T o D e v i c e )

The arguments here are a pointer to the first destination element and the pitch of the destination array, a pointer to the first source element and pitch

of the source array, the width and height of the submatrix to transfer, and the memcpy kind. There is also a cudaMemcpy3D()

(http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__MEMORY_1gc1372614eb614f4689fbb82b4692d30a) function for transfers

of rank three array sections.

Summary

Page 4: How to Optimize Data Transfers in CUDA C_C++ _ Parallel Forall

Separate Compilation and Linking of CUDA C++ DeviceCode2 comments • 6 months ago

Mark Ebersole — You're right! Thanks for catching that. I added inthe missing lines.

CUDACasts Episode 21: Porting a simple OpenCV sampleto the …1 comment • 19 hours ago

Chim Lee Cheung — The last demo is optical flow? Is it running inreal time?

CUDA Pro Tip: Occupancy API Simplifies LaunchConfiguration5 comments • 3 months ago

karthikeyan Natarajan — I required this one for very long time andwrote this myself using cuda …

Accelerating Graph Betweenness Centrality with CUDA2 comments • 3 months ago

Mark Harris — This would probably be a better conversation to startvia our contact form, but... We do nothing …

ALSO ON PARALLEL FORALL

0 Comments Parallel Forall  Login

Sort by Best Share ⤤

Start the discussion…

Be the first to comment.

WHAT'S THIS?

Subscribe✉ Add Disqus to your sited Privacy

Favorite ★

About (/parallelforall/about/) Contact (/parallelforall/contact/)

Copyright © 2014 NVIDIA Corporation Legal Information (http://www.nvidia.com/object/legal_info.html) Privacy Policy

(http://www.nvidia.com/object/privacy_policy.html) Code of Conduct (https://developer.nvidia.com/code-of-conduct)

Share:

(http://www.addthis.com/bookmark.php?v=300&winname=addthis&pub=xa-51d1f0cb3c546e3d&source=tbx-300&lng=es-

419&s=reddit&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-optimize-data-transfers-cuda-

cc%2F&title=How%20to%20Optimize%20Data%20Transfers%20in%20CUDA%20C%2FC%2B%2B&ate=AT-xa-51d1f0cb3c546e3d/-/-

/543d434364e8faf7/5&frommenu=1&uid=543d4343a81889b9&ct=1&pre=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-

transfers-cuda-cc%2F&tt=0&captcha_provider=nucaptcha)

(http://www.addthis.com/bookmark.php?v=300&winname=addthis&pub=xa-51d1f0cb3c546e3d&source=tbx-300&lng=es-

419&s=google_plusone_share&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-optimize-data-transfers-cuda-

cc%2F&title=How%20to%20Optimize%20Data%20Transfers%20in%20CUDA%20C%2FC%2B%2B&ate=AT-xa-51d1f0cb3c546e3d/-/-

/543d434364e8faf7/6&frommenu=1&uid=543d43436796c812&ct=1&pre=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-

transfers-cuda-cc%2F&tt=0&captcha_provider=nucaptcha)

(http://www.addthis.com/bookmark.php?v=300&winname=addthis&pub=xa-51d1f0cb3c546e3d&source=tbx-300&lng=es-

419&s=linkedin&url=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-optimize-data-transfers-cuda-

cc%2F&title=How%20to%20Optimize%20Data%20Transfers%20in%20CUDA%20C%2FC%2B%2B&ate=AT-xa-51d1f0cb3c546e3d/-/-

/543d434364e8faf7/7&frommenu=1&uid=543d434308b47d12&ct=1&pre=http%3A%2F%2Fdevblogs.nvidia.com%2Fparallelforall%2Fhow-overlap-data-

transfers-cuda-cc%2F&tt=0&captcha_provider=nucaptcha)

Transfers between the host and device are the slowest link of data movement involved in GPU computing, so you should take care to minimize

transfers. Following the guidelines in this post can help you make sure necessary transfers are efficient. When you are porting or writing new CUDA

C/C++ code, I recommend that you start with pageable transfers from existing host pointers. As I mentioned earlier, as you write more device code you

will eliminate some of the intermediate transfers, so any effort you spend optimizing transfers early in porting may be wasted. Also, rather than

instrument code with CUDA events or other timers to measure time spent for each transfer, I recommend that you use nvprof, the command-line

CUDA profiler, or one of the visual profiling tools such as the NVIDIA Visual Profiler (also included with the CUDA Toolkit).

This post focused on making data transfers efficient. In the next post (http://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/),

we discuss how you can overlap data transfers with computation and with other data transfers.

∥∀

ShareShare

Share

About Mark Harris

Mark is Chief Technologist for GPU Computing Software at NVIDIA. Mark has fifteen years of experience developing software for GPUs,ranging from graphics and games, to physically-based simulation, to parallel algorithms and high-performance computing. Mark hasbeen using GPUs for general-purpose computing since before they even supported floating point arithmetic. While a Ph.D. student atUNC he recognized this nascent trend and coined a name for it: GPGPU (General-Purpose computing on Graphics Processing Units), andstarted GPGPU.org to provide a forum for those working in the field to share and discuss their work.

Follow @harrism on Twitter (https://twitter.com/intent/user?screen_name=harrism)View all posts by Mark Harris → (http://devblogs.nvidia.com/parallelforall/author/mharris/)