GTC On Demand | NVIDIA GTC Digital - GPU-InfiniBand … · 2013. 3. 20. · Connect-IB Performance...

17
Pak Lui March 2013, GTC GPU-InfiniBand Accelerations for Hybrid Compute Systems

Transcript of GTC On Demand | NVIDIA GTC Digital - GPU-InfiniBand … · 2013. 3. 20. · Connect-IB Performance...

  • Pak Lui

    March 2013, GTC

    GPU-InfiniBand Accelerations for Hybrid Compute Systems

  • © 2013 Mellanox Technologies 2 - Mellanox Confidential -

    Leading Supplier of End-to-End Interconnect Solutions

    Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables

    Comprehensive End-to-End InfiniBand and Ethernet Portfolio

    Virtual Protocol Interconnect

    Storage Front / Back-End

    Server / Compute Switch / Gateway

    56G IB & FCoIB 56G InfiniBand

    10/40/56GbE & FCoE 10/40/56GbE

    Fibre Channel

    Virtual Protocol Interconnect

  • © 2013 Mellanox Technologies 3 - Mellanox Confidential -

    Virtual Protocol Interconnect (VPI) Technology

    64 ports 10GbE

    36 ports 40/56GbE

    48 10GbE + 12 40/56GbE

    36 ports IB up to 56Gb/s

    8 VPI subnets

    Switch OS Layer

    Mezzanine Card

    VPI Adapter VPI Switch

    Ethernet: 10/40/56 Gb/s

    InfiniBand:10/20/40/56 Gb/s

    Unified Fabric Manager

    Networking Storage Clustering Management

    Applications

    Acceleration Engines

    LOM Adapter Card

    3.0

    From data center to

    campus and metro

    connectivity

    http://www.redhat.com/

  • © 2013 Mellanox Technologies 4 - Mellanox Confidential -

    A new interconnect architecture for compute intensive applications

    World’s fastest server and storage interconnect solution

    Enables unlimited clustering (compute and storage) scalability

    Accelerates compute-intensive and parallel-intensive applications

    Optimized for multi-tenant environments of 100s of Virtual Machines per server

    Connect-IB: The Exascale Foundation

  • © 2013 Mellanox Technologies 5 - Mellanox Confidential -

    World’s first 100Gb/s InfiniBand interconnect adapter

    • PCIe 3.0 x16, dual FDR 56Gb/s InfiniBand ports to provide >100Gb/s

    Highest InfiniBand message rate: 137 million messages per second

    • 4X higher than other InfiniBand solutions

    Connect-IB Performance Highlights

    Unparalleled Throughput and Message Injection Rates

  • © 2013 Mellanox Technologies 6 - Mellanox Confidential -

    The GPUDirect project was announced Nov 2009

    • “NVIDIA Tesla GPUs To Communicate Faster Over Mellanox InfiniBand Networks”, http://www.nvidia.com/object/io_1258539409179.html

    GPUDirect was developed together by Mellanox and NVIDIA

    • New interface (API) within the Tesla GPU driver

    • New interface within the Mellanox InfiniBand drivers

    • Linux kernel modification to allow direct communication between drivers

    GPUDirect 1.0 was announced Q2’10

    • “Mellanox Scalable HPC Solutions with NVIDIA GPUDirect Technology Enhance GPU-Based HPC Performance and Efficiency”

    • “Mellanox was the lead partner in the development of NVIDIA GPUDirect”

    GPUDirect RDMA will be released Q2’13

    GPUDirect History

  • © 2013 Mellanox Technologies 7 - Mellanox Confidential -

    GPU communications uses “pinned” buffers for data movement

    • A section in the host memory that is dedicated for the GPU

    • Allows optimizations such as write-combining and overlapping GPU computation and data transfer for best performance

    InfiniBand uses “pinned” buffers for efficient RDMA transactions

    • Zero-copy data transfers, Kernel bypass

    • Reduces CPU overhead

    GPU-InfiniBand Bottleneck (pre-GPUDirect)

    CPU

    GPU Chip set

    GPU Memory

    InfiniBand

    System

    Memory

  • © 2013 Mellanox Technologies 8 - Mellanox Confidential -

    GPUDirect 1.0

    CPU

    GPU Chip set

    GPU Memory

    InfiniBand

    System

    Memory 1 2

    CPU

    GPU Chip set

    GPU Memory

    InfiniBand

    System

    Memory

    1

    2

    Transmit Receive

    CPU

    GPU Chip set

    GPU Memory

    InfiniBand

    System

    Memory

    1

    CPU

    GPU Chip set

    GPU Memory

    InfiniBand

    System

    Memory

    1

    Non GPUDirect

    GPUDirect 1.0

  • © 2013 Mellanox Technologies 9 - Mellanox Confidential -

    LAMMPS

    • 3 nodes, 10% gain

    Amber – Cellulose

    • 8 nodes, 32% gain

    Amber – FactorIX

    • 8 nodes, 27% gain

    GPUDirect 1.0 – Application Performance

    3 nodes, 1 GPU per node 3 nodes, 3 GPUs per node

  • © 2013 Mellanox Technologies 10 - Mellanox Confidential -

    GPUDirect RDMA

    Transmit Receive

    CPU

    GPU Chip set

    GPU Memory

    InfiniBand

    System

    Memory

    1

    CPU

    GPU Chip set

    GPU Memory

    InfiniBand

    System

    Memory

    1

    GPUDirect RDMA

    CPU

    GPU Chip set

    GPU Memory

    InfiniBand

    System

    Memory

    1

    CPU

    GPU Chip set

    GPU Memory

    InfiniBand

    System

    Memory

    1

    GPUDirect 1.0

  • • Preliminary driver for GPU-Direct is under work by NVIDIA and

    Mellanox

    • OSU has done an initial design of MVAPICH2 with the latest GPU-

    Direct-RDMA Driver

    – Hybrid design

    – Takes advantage of GPU-Direct-RDMA for short messages

    – Uses host-based buffered design in current MVAPICH2 for large messages

    – Alleviates Sandybridge chipset bottleneck

    DK-OSU-MVAPICH2-GPU-Direct-RDMA 11

    Initial Design of OSU-MVAPICH2 with GPU-Direct-RDMA

  • DK-OSU-MVAPICH2-GPU-Direct-RDMA 12

    Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA

    Based on MVAPICH2-1.9b Intel Sandy Bridge (E5-2670) node with 16 cores

    NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

    GPU-GPU Internode MPI Latency

    0

    5

    10

    15

    20

    25

    30

    35

    40

    1 4 16 64 256 1K 4K

    MVAPICH2-1.9b MVAPICH2-1.9b-GDR-Hybrid

    Small Message Latency

    Message Size (bytes)

    Late

    ncy

    (u

    s)

    0

    200

    400

    600

    800

    1000

    1200

    16K 64K 256K 1M 4M

    MVAPICH2-1.9b MVAPICH2-1.9b-GDR-Hybrid

    Large Message Latency

    Message Size (bytes)

    Late

    ncy

    (u

    s)

  • DK-OSU-MVAPICH2-GPU-Direct-RDMA 13

    Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA

    Based on MVAPICH2-1.9b Intel Sandy Bridge (E5-2670) node with 16 cores

    NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

    GPU-GPU Internode MPI Uni-directional Bandwidth

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    1 4 16 64 256 1K 4K

    MVAPICH2-1.9b MVAPICH2-1.9b-GDR-Hybrid

    Message Size (bytes)

    Ban

    dw

    idth

    (M

    B/s

    )

    Small Message Bandwidth

    0

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    16K 64K 256K 1M 4M

    MVAPICH2-1.9b MVAPICH2-1.9b-GDR-Hybrid

    Message Size (bytes)

    Ban

    dw

    idth

    (M

    B/s

    )

    Large Message Bandwidth

  • DK-OSU-MVAPICH2-GPU-Direct-RDMA 14

    Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA

    Based on MVAPICH2-1.9b Intel Sandy Bridge (E5-2670) node with 16 cores

    NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

    GPU-GPU Internode MPI Bi-directional Bandwidth

    0

    200

    400

    600

    800

    1000

    1200

    1 4 16 64 256 1K 4K

    MVAPICH2-1.9b MVAPICH2-1.9b-GDR-Hybrid

    Message Size (bytes)

    Ban

    dw

    idth

    (M

    B/s

    )

    Small Message Bi-Bandwidth

    0

    2000

    4000

    6000

    8000

    10000

    12000

    16K 64K 256K 1M 4M

    MVAPICH2-1.9b MVAPICH2-1.9b-GDR-Hybrid

    Message Size (bytes)

    Ban

    dw

    idth

    (M

    B/s

    )

    Large Message Bi-Bandwidth

  • © 2013 Mellanox Technologies 15 - Mellanox Confidential -

    Remote GPU Access through rCUDA

    GPU servers GPU as a Service

    rCUDA daemon

    Network Interface CUDA

    Driver + runtime Network Interface

    rCUDA library

    Application

    Client Side Server Side

    Application

    CUDA

    Driver + runtime

    CUDA Application

    rCUDA provides remote access from

    every node to any GPU in the system

  • © 2013 Mellanox Technologies 16 - Mellanox Confidential -

    GPU as a Service

    GPUs as a network-resident service • Little to no overhead when using FDR InfiniBand

    Virtualize and decouple GPU services from CPU services • A new paradigm in cluster flexibility

    • Lower cost, lower power and ease of use with shared GPU resources

    • Remove difficult physical requirements of the GPU for standard compute servers

    GPU

    CPU

    GPU

    CPU GPU

    CPU

    GPU

    CPU

    GPU

    CPU

    GPUs in every server GPUs as a Service

    CPU

    VGPU

    CPU

    VGPU

    CPU

    VGPU

    GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU

  • © 2013 Mellanox Technologies 17 - Mellanox Confidential -

    Thank You