rCUDA: a readyrCUDA: a ready-to-use remote GPUuse remote ... · HPC Advisory Council European...
Transcript of rCUDA: a readyrCUDA: a ready-to-use remote GPUuse remote ... · HPC Advisory Council European...
rCUDA: a ready to use remote GPUrCUDA: a ready-to-use remote GPU virtualizacion framework
Rafael MayoUniversitat Jaume I de Castelló (Spain)Universitat Jaume I de Castelló (Spain)
Joint research effort
Outline
GPU computing GPU computing rCUDA framework Other GPU virtualization frameworks rCUDA SLURM integration rCUDA SLURM integration SDK and real applications tests
HPC Advisory Council European Conference, Leipzig 2013 2/72
Outline
GPU computing GPU computing rCUDA framework Other GPU virtualization frameworks rCUDA SLURM integration rCUDA SLURM integration SDK and real applications tests
HPC Advisory Council European Conference, Leipzig 2013 3/72
GPU computing GPU computing: defines all the technological issues for
using the GPU computational power for executing l dgeneral purpose code
GPU computing has experienced remarkable growth in th l tthe last years
Top500pNovember 2012
HPC Advisory Council European Conference, Leipzig 2013 4/72
GPU computing GPU computing: defines all the technological issues for
using the GPU computational power for executing l dgeneral purpose code
GPU computing has experienced remarkable growth in th l tthe last years
Top500pNovember 2012
HPC Advisory Council European Conference, Leipzig 2013 5/72
GPU computing
The basic building block is a node with 1 or more GPUsPU em
Main Memory
PU em
CPU
GPU
GP
me
Net
wor
k
GPU
GP
me
PCI-e
U m U m U m U m Main Memory
GPU
GPU
mem
GPU
GPU
mem
GPU
GPU
mem
GPU
GPU
mem Main Memory
CPU
Net
wor
kN
PCI-e
HPC Advisory Council European Conference, Leipzig 2013 6/72
GPU computing
From the programming point of view: A set of nodes, each one with:
one or more CPUs (with several cores per CPU) one or more GPUs (1-4) disjoint memory spaces for each GPU and CPU
An interconnection network
GPU GPUmem GPU GPU
memGPU GPUmem GPU GPU
mem GPU GPUmem GPU GPU
mem
PCI-e C
PU
Main
Mem
o
GPU GPUmem
PCI-e C
PU
Main
Mem
o
GPU GPUmem
PCI-e C
PU
Main
Mem
o
GPU GPUmem
PCI-e C
PU
Main
Mem
o
GPU GPUmem
PCI-e C
PU
Main
Mem
o
GPU GPUmem
PCI-e C
PU
Main
Mem
o
GPU GPUmem
ory
Network
Interconnection Network
ory
Network
ory
Network
ory
Network
ory
Network
ory
Network
HPC Advisory Council European Conference, Leipzig 2013 7/72
Interconnection Network
GPU computing
Development tools have been introduced in porder to ease the programming of the GPUsTwo main approaches in GPU computing Two main approaches in GPU computing development environments: CUDA NVIDIA proprietary OpenCL open standard OpenCL open standard
HPC Advisory Council European Conference, Leipzig 2013 8/72
GPU computing
Basically CUDA and OpenCL have the same working scheme:working scheme: Compilation: Separate CPU code from GPU
code (GPU kernel)code (GPU kernel) Execution:
D t t f CPU d GPU Data transfers: CPU and GPU memory spaces1.Before GPU kernel execution: data from CPU
memory space to GPU memory space2.Computation: Kernel execution3.After GPU kernel execution: results from GPU
memory space to CPU memory space
HPC Advisory Council European Conference, Leipzig 2013 9/72
GPU computing
Time spent on data transfers may not be negligible
90
100Influence of data transfers for SGEMM
Pinned MemoryNon-Pinned Memory
%)
60
70
80y
rans
fers
(%
40
50
60
ed to
dat
a tr
10
20
30
Tim
e de
vote
0 2000 4000 6000 8000 10000 12000 14000 16000 180000
10
Matrix Size
T
HPC Advisory Council European Conference, Leipzig 2013 10/72
Matrix Size
GPU computing
For the right kind of code the use of GPUs brings huge benefits in terms of performance and energybenefits in terms of performance and energy
There must be data parallelism in the code: this is the l t t k b fit f th h d d fonly way to take benefit from the hundreds of processors
in a GPU Different scenarios from the point of view of the
application: Low level of data parallelism High level of data parallelism Moderate level of data parallelism Applications for multi-GPU computing
HPC Advisory Council European Conference, Leipzig 2013 11/72
Applications for multi GPU computing
Outline
GPU computing GPU computing rCUDA framework Other GPU virtualization frameworks rCUDA SLURM integration rCUDA SLURM integration SDK and real applications tests
HPC Advisory Council European Conference, Leipzig 2013 12/72
rCUDA framework
rCUDArCUDAA framework enabling that a CUDA-based
application running in one node can access GPUs in other nodesaccess GPUs in other nodes
It is useful when you have:
Moderate level of data parallelism Moderate level of data parallelism
Applications for multi GPU computing
HPC Advisory Council European Conference, Leipzig 2013 13/72
pp p g
rCUDA framework
Virtualized Remote GPUs
Node GPUNode GPU GPUNode GPU
Node GPUnetw
orkNode GPU
Nodenetw
ork GPU
GPUNode GPU
Node GPU
nnec
tionNode
Node GPU
nnec
tion GPU
GPUGNode GPU
Node GPUGPU
Inte
rcon
Node GPU
NodeInte
rcon GPU
GPUGPU
GPU
VirtualGPUs
GPUs
GPU
HPC Advisory Council European Conference, Leipzig 2013 14/72
GPUs
rCUDA framework
CUDA application
Application
CUDAlibraries
HPC Advisory Council European Conference, Leipzig 2013 15/72
rCUDA framework
CUDA applicationClient side Server side
ApplicationApplication
CUDAdriver + runtime
CUDAlibraries
HPC Advisory Council European Conference, Leipzig 2013 16/72
rCUDA framework
CUDA applicationClient side Server side
Application rCUDA daemon
CUDAlibraries
rCUDA library
Networkdevice
Networkdevicedevice device
HPC Advisory Council European Conference, Leipzig 2013 17/72
rCUDA framework
CUDA applicationClient side Server side
Application rCUDA daemon
CUDAdriver + runtime
rCUDA library
Networkdevice
Networkdevicedevice device
HPC Advisory Council European Conference, Leipzig 2013 18/72
rCUDA framework
CUDA applicationClient side Server side
Application rCUDA daemon
CUDAdriver + runtime
rCUDA library
Networkdevice
Networkdevicedevice device
HPC Advisory Council European Conference, Leipzig 2013 19/72
rCUDA framework
rCUDA uses a proprietary communication protocolcommunication protocol
Example:
1) initialization2) memory allocation on the
remote GPUremote GPU3) CPU to GPU memory
transfer of the input data 4) kernel execution4) kernel execution5) GPU to CPU memory
transfer of the results 6) GPU memory release) y7) communication channel
closing and server process finalization
HPC Advisory Council European Conference, Leipzig 2013 20/72
p
rCUDA framework
Application
Client side Server side
Interception library Server daemon
pp
Network interface Network interface
CUDA libraryCommunications Communications
TCP IB TCP IBNetwork interface Network interfaceTCP IB TCP IB
HPC Advisory Council European Conference, Leipzig 2013 21/72
rCUDA framework
Optimized communications
Use GPUDirect to avoid memory copies Pipeline to improve performance Pipeline to improve performance Preallocated pinned memory buffers Optimal pipeline block size
HPC Advisory Council European Conference, Leipzig 2013 22/72
rCUDA framework Efficient pipeline implementation
Synchronous transfers Pageable memory Synchronous transfers Pageable memory
ES
T
ES
TNetwork
e
RE
QU
E
RE
QU
E
ent s
ide Server s
SEND / RECV
Clie
side
GPU Main Memory MainPinned Pinned
HPC Advisory Council European Conference, Leipzig 2013 23/72
MemoryMain Memory Memory
rCUDA framework Efficient pipeline implementation
Synchronous transfers Pageable memory Synchronous transfers Pageable memory
ES
T
ES
TNetwork
e
RE
QU
E
RE
QU
E
ent s
ide Server s
SEND / RECV
Clie
side
GPU Main Memory MainPinned Pinned
HPC Advisory Council European Conference, Leipzig 2013 24/72
MemoryMain Memory Memory
rCUDA framework Efficient pipeline implementation
Synchronous transfers Pageable memory Synchronous transfers Pageable memory
ES
T
ES
TNetwork
e
RE
QU
E
RE
QU
E
ent s
ide Server s
SEND / RECV
Clie
side
GPU Main Memory MainPinned Pinned
HPC Advisory Council European Conference, Leipzig 2013 25/72
MemoryMain Memory Memory
rCUDA framework Efficient pipeline implementation
Synchronous transfers Pageable memory Synchronous transfers Pageable memory
ES
T
ES
TNetwork
e
RE
QU
E
RE
QU
E
ent s
ide Server s
SEND / RECV
Clie
side
GPU Main Memory MainPinned Pinned
HPC Advisory Council European Conference, Leipzig 2013 26/72
MemoryMain Memory Memory
rCUDA framework Efficient pipeline implementation
Synchronous transfers Pageable memory Synchronous transfers Pageable memory
ES
T
ES
TNetwork
e
RE
QU
E
RE
QU
E
ent s
ide Server s
SEND / RECV
Clie
side
GPU Main Memory MainPinned Pinned
HPC Advisory Council European Conference, Leipzig 2013 27/72
MemoryMain Memory Memory
rCUDA framework Efficient pipeline implementation
Synchronous transfers Pageable memory Synchronous transfers Pageable memory
ES
T
ES
TNetwork
e
RE
QU
E
RE
QU
E
ent s
ide Server s
SEND / RECV
Clie
side
GPU Main Memory MainPinned Pinned
HPC Advisory Council European Conference, Leipzig 2013 28/72
MemoryMain Memory Memory
rCUDA framework Efficient pipeline implementation
Synchronous transfers Pageable memory Synchronous transfers Pageable memory
ES
T
ES
TNetwork
e
RE
QU
E
RE
QU
E
ent s
ide Server s
SEND / RECV
Clie
side
GPU Main Memory MainPinned Pinned
HPC Advisory Council European Conference, Leipzig 2013 29/72
MemoryMain Memory Memory
rCUDA framework Efficient pipeline implementation
Synchronous transfers Pageable memory Synchronous transfers Pageable memory
ES
T
ES
TNetwork
e
RE
QU
E
RE
QU
E
ent s
ide Server s
SEND / RECV
Clie
side
GPU Main Memory MainPinned Pinned
HPC Advisory Council European Conference, Leipzig 2013 30/72
MemoryMain Memory Memory
rCUDA framework Efficient pipeline implementation
Synchronous transfers Pageable memory Synchronous transfers Pageable memory
ES
T
ES
TNetwork
e
RE
QU
E
RE
QU
E
ent s
ide Server s
SEND / RECV
Clie
side
GPU Main Memory MainPinned Pinned
HPC Advisory Council European Conference, Leipzig 2013 31/72
MemoryMain Memory Memory
rCUDA framework Efficient pipeline implementation
Asynchronous transfers Pinned memory Asynchronous transfers Pinned memory
ES
T
ES
TNetworkR
EQ
U
RE
QU
SEND / RECV
een
t sid
e Server sC
lieside
GPUMain Main
HPC Advisory Council European Conference, Leipzig 2013 32/72
GPU Memory
Main Memory
Main Memory
rCUDA framework Efficient pipeline implementation
Asynchronous transfers Pinned memory Asynchronous transfers Pinned memory
ES
T
ES
TNetworkR
EQ
U
RE
QU
SEND / RECV
een
t sid
e Server sC
lieside
GPUMain Main
HPC Advisory Council European Conference, Leipzig 2013 33/72
GPU Memory
Main Memory
Main Memory
rCUDA framework Efficient pipeline implementation
Asynchronous transfers Pinned memory Asynchronous transfers Pinned memory
ES
T
ES
TNetworkR
EQ
U
RE
QU
SEND / RECV
een
t sid
e Server sC
lieside
GPUMain MainRDMA READ
HPC Advisory Council European Conference, Leipzig 2013 34/72
GPU Memory
Main Memory
Main Memory
RDMA READ
rCUDA framework Efficient pipeline implementation
Asynchronous transfers Pinned memory Asynchronous transfers Pinned memory
ES
T
ES
TNetworkR
EQ
U
RE
QU
SEND / RECV
een
t sid
e Server sC
lieside
GPUMain MainRDMA READ
HPC Advisory Council European Conference, Leipzig 2013 35/72
GPU Memory
Main Memory
Main Memory
RDMA READ
rCUDA framework Efficient pipeline implementation
Asynchronous transfers Pinned memory Asynchronous transfers Pinned memory
ES
T
ES
TNetworkR
EQ
U
RE
QU
SEND / RECV
een
t sid
e Server sC
lieside
GPUMain MainRDMA READ
HPC Advisory Council European Conference, Leipzig 2013 36/72
GPU Memory
Main Memory
Main Memory
RDMA READ
rCUDA framework Efficient pipeline implementation
Asynchronous transfers Pinned memory Asynchronous transfers Pinned memory
ES
T
ES
TNetworkR
EQ
U
RE
QU
SEND / RECV
een
t sid
e Server sC
lieside
GPUMain MainRDMA READ
HPC Advisory Council European Conference, Leipzig 2013 37/72
GPU Memory
Main Memory
Main Memory
RDMA READ
rCUDA framework Efficient pipeline implementation
Asynchronous transfers Pinned memory Asynchronous transfers Pinned memory
ES
T
ES
TNetworkR
EQ
U
RE
QU
SEND / RECV
een
t sid
e Server sC
lieside
GPUMain MainRDMA READ
HPC Advisory Council European Conference, Leipzig 2013 38/72
GPU Memory
Main Memory
Main Memory
RDMA READ
rCUDA framework Efficient pipeline implementation
Asynchronous transfers Pinned memory Asynchronous transfers Pinned memory
ES
T
ES
TNetworkR
EQ
U
RE
QU
SEND / RECV
een
t sid
e Server sC
lieside
GPUMain MainRDMA READ
HPC Advisory Council European Conference, Leipzig 2013 39/72
GPU Memory
Main Memory
Main Memory
RDMA READ
rCUDA framework Efficient pipeline implementation
Asynchronous transfers Pinned memory Asynchronous transfers Pinned memory
ES
T
ES
TNetworkR
EQ
U
RE
QU
SEND / RECV
een
t sid
e Server sC
lieside
GPUMain Main
HPC Advisory Council European Conference, Leipzig 2013 40/72
GPU Memory
Main Memory
Main Memory
rCUDA framework
Pipeline block size for Infiniband FDR
• NVIDIA Tesla K20
HPC Advisory Council European Conference, Leipzig 2013 41/72
• Mellanox ConnectX-3 + SX6025 Mellanox switch
rCUDA framework
Host to device copy with pinned memory
HPC Advisory Council European Conference, Leipzig 2013 42/72
rCUDA framework
Host to device copy with pageable memory
HPC Advisory Council European Conference, Leipzig 2013 43/72
rCUDA framework
Copy a small dataset for latency study (64 bytes)
HPC Advisory Council European Conference, Leipzig 2013 44/72
Outline
GPU computing GPU computing rCUDA framework Other GPU virtualization frameworks rCUDA SLURM integration rCUDA SLURM integration SDK and real applications tests
HPC Advisory Council European Conference, Leipzig 2013 45/72
GPU virtualization
Several efforts have been done regarding to GPU virtualization in the recent yearsvirtualization in the recent years. rCUDA (CUDA 5.0) GVirtuS (CUDA 3.2) DS-CUDA (CUDA 4.1) vCUDA (CUDA 1.1) GViM (CUDA 1.1) GViM (CUDA 1.1) GridCUDA (CUDA 2.3)
V GPU (CUDA 4 0) V-GPU (CUDA 4.0)
HPC Advisory Council European Conference, Leipzig 2013 46/72
GPU virtualization
Comparison of virtualization solutions.L t (t f 64 b t ) Latency (transfer 64 bytes) Intel Xeon E5-2620 (6 cores) 2,0GHz GPU NVIDIA Tesla K20 GPU NVIDIA Tesla K20 Mellanox ConnectX-3 single-port InfiniBand Adapter (FDR) CentOS 6.3 + Mellanox OFED 1.5.3
Pageable Pinned Pageable PinnedPageableH2D
PinnedH2D
PageableD2H
PinnedD2H
CUDA 34,3 4,3 16,2 5,2
rCUDA 94,5 23,1 292,2 6,0
GVirtuS 184,2 200,3 168,4 182,8
DS-CUDA 45 9 - 26 5 -
HPC Advisory Council European Conference, Leipzig 2013 47/72
DS CUDA 45,9 26,5
GPU virtualization
• Bandwidth host to device using pageable memoryec
MB
/se
Transfer size (MB)
HPC Advisory Council European Conference, Leipzig 2013 48/72
Transfer size (MB)
GPU virtualization
• Bandwidth device to host using pageable memoryec
MB
/se
Transfer size (MB)
HPC Advisory Council European Conference, Leipzig 2013 49/72
Transfer size (MB)
GPU virtualization
• Bandwidth host to device using pinned memoryec
MB
/se
Transfer size (MB)
HPC Advisory Council European Conference, Leipzig 2013 50/72
Transfer size (MB)
GPU virtualization
• Bandwidth device to host using pinned memoryec
MB
/se
Transfer size (MB)
HPC Advisory Council European Conference, Leipzig 2013 51/72
Transfer size (MB)
Outline
GPU computing GPU computing rCUDA framework Other GPU virtualization frameworks rCUDA SLURM integration rCUDA SLURM integration SDK and real applications tests
HPC Advisory Council European Conference, Leipzig 2013 52/72
SLURM integration
• SLURM job scheduler.j• Does not understand about virtualized GPUs.
Add GRES ( l ) i d t• Add a new GRES (general resource) in order to manage the virtualized GPUs.
• Where the GPUs are in the system is completely transparent to the user.
• In the job script or in the submision command, the user specifies the number of rGPUS for thethe user specifies the number of rGPUS for the job.
HPC Advisory Council European Conference, Leipzig 2013 53/72
SLURM integration
HPC Advisory Council European Conference, Leipzig 2013 54/72
SLURM integration
slurm.cof
ClusterName=rcu…GresTypes=gpu rgpuGresTypes=gpu,rgpu…NodeName=rcu16 NodeAddr=rcu16 CPUs=8 Sockets=1
CoresPerSocket=4 ThreadsPerCore=2 RealMemory=7990 Gres=rgpu:4,gpu:4
HPC Advisory Council European Conference, Leipzig 2013 55/72
SLURM integration
scontrol show node
NodeName=rcu16 Arch=x86_64 CoresPerSocket=4CPUAlloc 0 CPUErr 0 CPUTot 8 Features (null)CPUAlloc=0 CPUErr=0 CPUTot=8 Features=(null)Gres=rgpu:4,gpu:4NodeAddr=rcu16 NodeHostName=rcu16OS=Linux RealMemory=7682 Sockets=1OS=Linux RealMemory=7682 Sockets=1State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1BootTime=2013-04-24T18:45:35SlurmdStartTime=2013-04-30T10:02:04
HPC Advisory Council European Conference, Leipzig 2013 56/72
SLURM integration
gres.conf
Name=rgpu File=/dev/nvidia0 Cuda=2.1 Mem=1535m
g
gpName=rgpu File=/dev/nvidia1 Cuda=3.5 Mem=1535mName=rgpu File=/dev/nvidia2 Cuda=1.2 Mem=1535mName=rgpu File=/dev/nvidia3 Cuda=1.2 Mem=1535m
/ /Name=gpu File=/dev/nvidia0Name=gpu File=/dev/nvidia1Name=gpu File=/dev/nvidia2Name=gpu File=/dev/nvidia3Name=gpu File=/dev/nvidia3
HPC Advisory Council European Conference, Leipzig 2013 57/72
SLURM integration
Submit a job$$ srun -N1 --gres=rgpu:4:512M script.sh...$
j
$
• Environment variables are initialized by SLURM and used byrCUDA client (transparently to the user)
RCUDA_DEVICE_COUNT=4
RCUDA_DEVICE_0=rcu16:0 Server name/IP address : GPURCUDA_DEVICE_1=rcu16:1RCUDA_DEVICE_2=rcu16:2RCUDA_DEVICE_3=rcu16:3
Server name/IP address : GPU
HPC Advisory Council European Conference, Leipzig 2013 58/72
SLURM integration
HPC Advisory Council European Conference, Leipzig 2013 59/72
SLURM integrationno
des
ed fr
om
deco
uple
PUs
are
dG
P
HPC Advisory Council European Conference, Leipzig 2013 60/72
Outline
GPU computing GPU computing rCUDA framework Other GPU virtualization frameworks rCUDA SLURM integration rCUDA SLURM integration SDK and real applications tests
HPC Advisory Council European Conference, Leipzig 2013 61/72
rCUDA test
Test systemI t l X E5 2620 (6 ) 2 0GH Intel Xeon E5-2620 (6 cores) 2,0GHz
GPU NVIDIA Tesla K20M ll C tX 3 i l t I fi iB d Ad t (FDR) Mellanox ConnectX-3 single-port InfiniBand Adapter (FDR)
Mellanox switch SX6025 (FDR)Ci it h SLM2014 (1G b Eth t) Cisco switch SLM2014 (1Gpbs Ethernet)
CentOS 6.3 + Mellanox OFED 1.5.3
HPC Advisory Council European Conference, Leipzig 2013 62/72
rCUDA test
Test CUDA SDK examples
8
9
CUDArCUDA FDR
5
6
7
rCUDA FDRrCUDA Gb
ed ti
me
3
4
5
Nor
mal
iz
1
2
N
BT IE RE AT TA FF SK CB MC LU MB SC FI HE BN IS LC CG TT IN F2 SN FD BF VR RG MI DI0
HPC Advisory Council European Conference, Leipzig 2013 63/72
rCUDA test
CUDASW++Please, don’t ask
me about this
Bioinformatics software for Smith-Waterman protein database searches
45 18
FDR Overhead QDR Overhead GbE Overhead CUDArCUDA FDR rCUDA QDR rCUDA GbE
)
2530354045
1012141618
erhe
ad (%
)
Tim
e (s
)
5101520
2468
rCU
DA
Ove
Exe
cutio
n
144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 54780 0
Sequence Length
r
HPC Advisory Council European Conference, Leipzig 2013 64/72
rCUDA testPlease don’t ask
CUDABLASTaccelerated version of the NCBI-BLAST . Basic Local Alignment
Please, don t ask me about this
Search Tool, a widely used bioinformatics tool.
350 60
FDR Overhead QDR Overhead GbE Overhead CUDArCUDA FDR rCUDA QDR rCUDA GbE
%)
150
200
250
300
30
40
50
Ove
rhea
d (%
n Ti
me
(s)
0
50
100
150
0
10
20
rCU
DA
O
Exe
cutio
n
2 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400
Sequence Length
HPC Advisory Council European Conference, Leipzig 2013 65/72
rCUDA test
Test system for MultiGPUF CUDA t t d ith For CUDA tests one node with: 2 Quad-Core Intel Xeon E5440 processors Tesla S2050 (4 Tesla GPUs) Tesla S2050 (4 Tesla GPUs) Each thread (1-4) uses one GPU
For rCUDA tests 8 nodes with: 2 Quad-Core Intel Xeon E5520 processors 1 NVIDIA Tesla C2050 Infiniband QDR Test running in one node and using up to all the GPUs of the others Each thread (1-8) uses one GPU
HPC Advisory Council European Conference, Leipzig 2013 66/72
rCUDA test
MonteCarlo MultiGPU (from NVIDIA SDK)
HPC Advisory Council European Conference, Leipzig 2013 67/72
rCUDA test
GEMM MultiGPU using libflame: high performance dense linear algebra library
http://www.cs.utexas.edu/~flame/web/
HPC Advisory Council European Conference, Leipzig 2013 68/72
rCUDA work in progress
Scheduling in SLURM taking into account theactual GPU utilizationactual GPU utilization
Test in ARM platforms CARMA board as client and rCUDA daemon on Intel Platform
equipped with NVIDIA Tesla Fermi Functional test with NVIDIA SDK and LAMMPS Functional test with NVIDIA SDK and LAMMPS Limited performance due to 1Gbps ethernet network
Test more applications from Test more applications from“Popular GPU-accelerated Applications”
S t t CUDA 5 5 ( l did t ) Support to CUDA 5.5 (release candidate)
HPC Advisory Council European Conference, Leipzig 2013 69/72
http://www.rcuda.netMore than 300 requests
Thanks to Mellanox for its support to this work
Antonio PeñaC l R ñ Adrian Castelló
rCUDA people(1)
Carlos ReañoFederico Silla
José Duato
Adrian CastellóRafael MayoEnrique S. Quintana-Ortí
(1) Now at Argonne National Lab. (USA)