Interconnect Your Future - HPC Advisory Council€¦ · The TOP500 list has evolved, includes HPC &...
Transcript of Interconnect Your Future - HPC Advisory Council€¦ · The TOP500 list has evolved, includes HPC &...
Paving the Road to Exascale
September 2016, HPC Advisory Council Spain Conference
Interconnect Your Future
© 2016 Mellanox Technologies 2
Mellanox Connects the World’s Fastest Supercomputer
93 Petaflop performance, 3X higher versus #2 on the TOP500
40K nodes, 10 million cores, 256 cores per CPU
Mellanox adapter and switch solutions
National Supercomputing Center in Wuxi, China
#1 on the TOP500 Supercomputing List
The TOP500 list has evolved, includes HPC & Cloud / Web2.0 Hyperscale systems
Mellanox connects 41.2% of overall TOP500 systems
Mellanox connects 70.4% of the TOP500 HPC platforms
Mellanox connects 46 Petascale systems, Nearly 50% of the total Petascale systems
InfiniBand is the Interconnect of Choice for
HPC Compute and Storage Infrastructures
© 2016 Mellanox Technologies 3
Accelerating HPC Leadership Systems
“Summit” System “Sierra” System
Proud to Pave the Path to Exascale
© 2016 Mellanox Technologies 4
The Ever Growing Demand for Higher Performance
2000 202020102005
“Roadrunner”
1st
2015
Terascale Petascale Exascale
Single-Core to Many-CoreSMP to Clusters
Performance Development
Co-Design
HW SW
APP
Hardware
Software
Application
The Interconnect is the Enabling Technology
© 2016 Mellanox Technologies 5
The Intelligent Interconnect to Enable Exascale Performance
CPU-Centric Co-Design
Work on The Data as it Moves
Enables Performance and Scale
Must Wait for the Data
Creates Performance Bottlenecks
Limited to Main CPU Usage
Results in Performance Limitation
Creating Synergies
Enables Higher Performance and Scale
© 2016 Mellanox Technologies 6
Breaking the Application Latency Wall
Today: Network device latencies are on the order of 100 nanoseconds
Challenge: Enabling the next order of magnitude improvement in application performance
Solution: Creating synergies between software and hardware – intelligent interconnect
Intelligent Interconnect Paves the Road to Exascale Performance
10 years ago
~10
microsecond
~100
microsecond
NetworkCommunication
Framework
Today
~10
microsecond
Communication
Framework
~0.1
microsecond
Network
~1
microsecond
Communication
Framework
Future
~0.05
microsecond
Co-Design
Network
© 2016 Mellanox Technologies 7
Mellanox Smart Interconnect Solutions: Switch-IB 2 and ConnectX-5
© 2016 Mellanox Technologies 8
Switch-IB 2 and ConnectX-5 Smart Interconnect Solutions
Delivering 10X Performance Improvement for MPI
and SHMEM/PGAS Communications
Barrier, Reduce, All-Reduce, Broadcast
Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND
Integer and Floating-Point, 32 / 64 bit
SHArP Enables Switch-IB 2 to Execute Data
Aggregation / Reduction Operations in the Network
PCIe Gen3 and Gen4
Integrated PCIe Switch
Advanced Dynamic Routing
MPI Collectives in Hardware
MPI Tag Matching in Hardware
In-Network Memory
100Gb/s Throughput
0.6usec Latency (end-to-end)
200M Messages per Second
© 2016 Mellanox Technologies 9
Highest-Performance 100Gb/s Interconnect Solutions
Transceivers
Active Optical and Copper Cables
(10 / 25 / 40 / 50 / 56 / 100Gb/s) VCSELs, Silicon Photonics and Copper
36 EDR (100Gb/s) Ports, <90ns Latency
Throughput of 7.2Tb/s
7.02 Billion msg/sec (195M msg/sec/port)
100Gb/s Adapter, 0.6us latency
200 million messages per second
(10 / 25 / 40 / 50 / 56 / 100Gb/s)
32 100GbE Ports, 64 25/50GbE Ports
(10 / 25 / 40 / 50 / 100GbE)
Throughput of 6.4Tb/s
MPI, SHMEM/PGAS, UPC
For Commercial and Open Source Applications
Leverages Hardware Accelerations
© 2016 Mellanox Technologies 10
The Performance Advantage of EDR 100G InfiniBand (28-80%)
28%
© 2016 Mellanox Technologies 11
Switch-IB 2 SHArP Performance Advantage
MiniFE is a Finite Element mini-application
• Implements kernels that represent
implicit finite-element applications
10X to 25X Performance Improvement
Allreduce MPI Collective
© 2016 Mellanox Technologies 12
SHArP Performance Advantage with Intel Xeon Phi Knight Landing
Maximizing KNL Performance – 50% Reduction in Run Time
(Customer Results)
Lower is better
Tim
e
Tim
e
OSU - OSU MPI benchmark; IMB – Intel MPI Benchmark
© 2016 Mellanox Technologies 13
HPC-X with SHArP Technology
HPC-X with SHArP Delivers 2.2X Higher Performance over Intel MPI
OpenFOAM is a popular computational fluid dynamics application
© 2016 Mellanox Technologies 14
Leading Supplier of End-to-End Interconnect Solutions
StoreAnalyzeEnabling the Use of Data
SoftwareICs Switches/GatewaysAdapter Cards Cables/Modules
Comprehensive End-to-End InfiniBand and Ethernet Portfolio (VPI)
Metro / WANNPU & Multicore
NPSTILE
© 2016 Mellanox Technologies 15
BlueField System-on-a-Chip (SoC) Solution
Integration of ConnectX5 + Multicore ARM
State of the art capabilities• 10 / 25 / 40 / 50 / 100G Ethernet & InfiniBand
• PCIe Gen3 /Gen4• Hardware acceleration offload
- RDMA, RoCE, NVMeF, RAID
Family of products• Range of ARM core counts and I/O ports/speeds
• Price/Performance points
NVMe Flash Storage Arrays
Scale-Out Storage (NVMe over Fabric)
Accelerating & Virtualizing VNFs
Open vSwitch (OVS), SDN
Overlay networking offloads
© 2016 Mellanox Technologies 16
InfiniBand Router Solutions
SB7780 Router 1U
Supports up to 6 Different Subnets
Isolation Between Different InfiniBand
Networks (Each Network can be Managed
Separately)
Native InfiniBand Connectivity Between
Different Network Topologies (Fat-Tree,
Torus, Dragonfly, etc.)
© 2016 Mellanox Technologies 17
Multi-Host Socket Direct – Low Latency Socket Communication
Each CPU with direct network access
QPI avoidance for I/O – improve performance
Enables GPU / peer direct on both sockets
Solution is transparent to software
CPU CPUCPU CPUQPI
Multi-Host Socket Direct Performance
50% Lower CPU Utilization
20% lower Latency
Multi Host Evaluation Kit
Lower Application Latency, Free-up CPU
© 2016 Mellanox Technologies 18
Technology Roadmap – One-Generation Lead over the Competition
2000 202020102005
20G 40G 56G 100G
“Roadrunner”Mellanox Connected
1st3rd
TOP500 2003Virginia Tech (Apple)
2015
200G
Terascale Petascale Exascale
Mellanox 400G
© 2016 Mellanox Technologies 19
Maximize Performance via Accelerator and GPU Offloads
GPUDirect RDMA Technology
© 2016 Mellanox Technologies 20
GPUs are Everywhere!
© 2016 Mellanox Technologies 21
GPU-GPU Internode MPI Latency
Low
er is
Bette
r
Performance of MPI with GPUDirect RDMA
88% Lower Latency
GPU-GPU Internode MPI Bandwidth
Hig
her
is B
ett
er
10X Increase in Throughput
Source: Prof. DK Panda
9.3X
2.18 usec
10x
© 2016 Mellanox Technologies 22
Mellanox GPUDirect RDMA Performance Advantage
HOOMD-blue is a general-purpose Molecular Dynamics simulation code accelerated on GPUs
GPUDirect RDMA allows direct peer to peer GPU communications over InfiniBand• Unlocks performance between GPU and InfiniBand
• This provides a significant decrease in GPU-GPU communication latency
• Provides complete CPU offload from all GPU communications across the network
102%
2X Application
Performance!
© 2016 Mellanox Technologies 23
GPUDirect Sync (GPUDirect 4.0)
GPUDirect RDMA (3.0) – direct data path between the GPU and Mellanox interconnect
• Control path still uses the CPU
- CPU prepares and queues communication tasks on GPU
- GPU triggers communication on HCA
- Mellanox HCA directly accesses GPU memory
GPUDirect Sync (GPUDirect 4.0)
• Both data path and control path go directly
between the GPU and the Mellanox interconnect
0
10
20
30
40
50
60
70
80
2 4
Ave
rag
e t
ime
pe
r it
era
tio
n (
us
)
Number of nodes/GPUs
2D stencil benchmark
RDMA only RDMA+PeerSync
27% faster 23% faster
Maximum Performance
For GPU Clusters
© 2016 Mellanox Technologies 24
NVIDIA DGX-1 Deep Learning Appliance
Deep Learning Supercomputer in a Box
8 x Pascal GPU (P100)
5.3TFlops
16nm FinFET
NVLINK
4 x ConnectX-4 EDR 100G InfiniBand
Adapters
© 2016 Mellanox Technologies 25
Interconnect Architecture Comparison
Offload versus Onload (Non-Offload)
© 2016 Mellanox Technologies 26
Offload versus Onload (Non-Offload)
Two interconnect architectures exist – Offload-based and Onload-based
Offload Architecture
• The Interconnect manages and executes all network operations
• The interconnect is capable of including application acceleration engines
• Offloads the CPU and therefore free CPU cycles to be used by the applications
• Development requires large R&D investment
• Higher data center ROI
Onload architecture
• A CPU-centric approach – everything must be executed on and by the CPU
• The CPU is responsible for all network functions, the interconnect only pushes the data into the wire
• Cannot support acceleration engines, no support for RDMA, and network transport is done by the CPU
• Onload the CPU and reduces the CPU cycles available for the applications
• Does not require R&D investments or interconnect expertise
© 2016 Mellanox Technologies 27
Sandia National Laboratory Paper – Offloading versus Onloading
© 2016 Mellanox Technologies 28
Application Performance Comparison – LS-DYNA
InfiniBand Delivers 42-63% Higher Performance With Only 12 Nodes
A structural and fluid analysis software, used for automotive, aerospace, manufacturing simulations and more
Higher is better Higher is better
Proprietary Proprietary
© 2016 Mellanox Technologies 29
Application Performance Comparison – SPEC MPI Benchmark Suite
The SPEC MPI benchmark suite evaluates MPI-parallel, floating point, compute intensive performance,
across a wide range of compute intensive applications using the Message-Passing Interface (MPI)
InfiniBand Delivers Superior Performance and Scaling
The Standard Performance Evaluation
Corporation (SPEC) is a non-profit
corporation formed to establish,
maintain and endorse a standardized
set of relevant benchmarks that can
be applied to the newest generation of
high-performance computers Proprietary
© 2016 Mellanox Technologies 30
Offload versus Onload – CPU Overhead
Operation
InfiniBand Proprietary Onload
CPU UtilizationCPU Frequency at
Operation TimeCPU Utilization
CPU Frequency at
Operation Time
100Gb/s
Data Throughput
(Send-Receive)
0.8% 59% 59.6% 100%
Intel Performance Counter Monitor Tool - Output
Data Throughput
(Gb/s)
AFREQ (relation to nominal CPU
frequency while in active
state)
CPU
InstructionsActive Cycles
InfiniBand 99.5 0.59 39M 163M
Proprietary 95 1 3725M 12000M
InfiniBand Guarantees Lowest CPU Overhead and Enables OPEX Saving!
© 2016 Mellanox Technologies 31
Application
Performance
Latency
Mellanox InfiniBand Leadership
34-62%
Higher20%
Lower
Message Rate Cost/Performance(CPAR - $/Performance)
68%
Higher
20-35%
Lower
100Gb/s
Link Speed
200Gb/s
Link Speed
2014
Gain Competitive Advantage Today
Protect Your Future
2017
Smart Network For Smart SystemsRDMA, Acceleration Engines, Programmability
Higher Performance
Unlimited Scalability
Higher Resiliency
Proven!
Thank You