Cisco usNIC: how it works, how it is used in Open MPI
-
Upload
jeff-squyres -
Category
Technology
-
view
3.588 -
download
3
description
Transcript of Cisco usNIC: how it works, how it is used in Open MPI
Cisco Public 1 © 2013 Cisco and/or its affiliates. All rights reserved.
Cisco Userspace NIC (usNIC)
Jeff Squyres Cisco Systems, Inc. November 7, 2013
Cisco Public 2 © 2013 Cisco and/or its affiliates. All rights reserved.
Yes, we sell servers now
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 3
Cisco UCS servers
Cisco 2 x 10Gb VIC
Cisco 10/40Gb Nexus switches
Record-‐seNng Intel Ivy Bridge
1U and 2U servers
Ultra low latency Ethernet
Yes, really!
40Gb top-‐of-‐rack and core switches
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 4 4
Rack
4 socket + giant memory HPC performance
Blade
UCS B420 M3 4-‐socket blade for
large-‐memory compute workloads
Cisco UCS: Many Server Form Factors, One System
UCS C240 M3 Perfect as HPC cluster head nodes
or IO nodes (2 socket)
UCS C220 M3 Ideal for HPC compute-‐intensive
applicaXons (2 socket)
UCS B200 M3 Blade form factor, 2-‐socket
UCS C420 M3 4-‐socket rack server for large-‐memory
compute workloads
Industry-‐leading compute without compromise
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 5
UCS impacBng growth of established vendors like HP
Legacy offerings flat-‐lining or in decline
Cisco growth out-‐pacing the market
Customers have shiMed 19.3% of the global x86 blade server market to Cisco and over 26% in the Americas (Source: IDC Worldwide Quarterly Server Tracker, Q1 2013 Revenue Share, May 2013) Source: IDC Worldwide Quarterly Server Tracker, Q1 2013 Revenue Share, May 2013
Worldwide X86 Server Blade Market Share
Demand for Data Center InnovaBon Has Vaulted Cisco Unified CompuBng System (UCS) to the #2 Leader in the Fast-‐Growing Segment of the x86 Server Market
Market AppeXte for InnovaXon Fuels
UCS Growth UCS #2 and climbing
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 6
Best CPU Performance 16 world records
Best VirtualizaXon & Cloud Performance
8 world records
Best Database Performance 9 world records
Best Enterprise ApplicaXon Performance 18 world records
Best Enterprise Middleware Performance 14 world records
Best HPC Performance 15 world records
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 7
One wire to rule them all: • Commodity traffic (e.g., ssh) • Cluster / hardware management • File system / IO traffic • MPI traffic
10G or 40G with real QoS
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 8 8
High density Low latency
Cisco Nexus: Years of experience rolled into dependable soluBons
Nexus 3548 190ns port-‐to-‐port latency (L2 and L3)
Created for HPC / HFT 48 10Gb / 12 40Gb ports
Low latency, high density 10 / 40Gb switches
Nexus 6004 1us port-‐to-‐port latency 384 10Gb / 96 40Gb ports
Cisco Public © 2013 Cisco and/or its affiliates. All rights reserved. 9
Spine
Leaf
CharacterisXcs • 3 Hops • Low OversubscripXon – Non-‐Blocking • < ~3.5 usecs depending on config and workload • 10G or 40G Capable • Spine: 4 to 16 Wide • Leaf: Determined by Spine Density
Spine -‐ Leaf Port Scale Latency Spines Leafs
10G Fabric 6004 -‐ 6001 18,432 x 10G 3:1 ~ 3 usecs Cut-‐through 16 384
40G Fabric 6004 -‐ 6004 7,680 x 40G 5:1 ~ 3 usecs Cut-‐through 16 96
Mixed Fabric 6004 -‐ 6001 4,680 x 10G 3:1 ~ 3 usecs S&F 4 96
10G Fabric 6004 -‐ 3548 12,288 x 10G 3:1 ~ 1.5 usecs Cut-‐through 16 384
40G Fabric 6004 -‐ 3548 1,152 x 40G 1:1 ~ 1.5 usecs Cut-‐through 6 96
Mixed Fabric 6004 -‐ 3548 3,072 x 10G 3:1 ~ 1.5 usecs S&F 4 96
…many other configuraBons are also possible
Cisco Public © 2013 Cisco and/or its affiliates. All rights reserved. 10
Leaf
Spine2
Spine1
Spine2-‐Spine1-‐Leaf Port Scale Latency Spine2 Spine1 Leafs
10G Fabric 6004 -‐ 6004 -‐ 6001 55,296 x 10G 3:1 ~ 3-‐5 usecs Cut-‐through 48 16 x 6 192
40G Fabric 6004 -‐ 6004 -‐ 6004 23,040 x 40G 5:1 ~ 3-‐5 usecs Cut-‐through 48 16 48
Mixed Fabric 6004 -‐ 6004 -‐ 6001 18,432 x 10G 3:1 ~ 3-‐5 usecs S&F 32 4 x 8 48
10G Fabric 6004 -‐ 6004 -‐ 3548 24,576 x 10G 2:1 ~ 1.5-‐3.5 usecs Cut-‐through 32 16 x 4 192
40G Fabric 6004 -‐ 6004 -‐ 3548 2,304 x 40G 1:1 ~ 1.5-‐3.5 usecs Cut-‐through 24 6 x 8 48
Mixed Fabric 6004 -‐ 6004 -‐ 3548 9,216x 10G 2:1 ~ 1.5-‐3.5 usecs S&F 24 6 x 8 48
CharacterisXcs • 3 Hops Pod – 5 hops DC east-‐west traffic • Low OversubscripXon – Non-‐Blocking • < ~3.5 usecs depending on config and
workload • 10G or 40G Capable • Two spine layers
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 11
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 12
• Direct access to NIC hardware from Linux userspace OperaXng System bypass Via the Linux Verbs API (UD)
• UXlizes Cisco Virtual Interface Card (VIC) for ultra-‐low Ethernet latency 2nd generaXon 80Gbps Cisco ASIC 2 x 10Gbps Ethernet ports 2 x 40Gbps coming …soon… PCI and mezzanine form factors
• Half-‐round trip (HRT) ping-‐pong latencies (Intel E5-‐2690 v2 servers): Raw back to back: 1.57μs MPI back to back: 1.85μs Through MPI+N3548: 2.05μs
These numbers keep going down
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 13
ApplicaXon
Kernel
Cisco VIC hardware
TCP stack
General Ethernet driver
Cisco VIC driver
Userspace
Userspace sockets library Userspace verbs library
Cisco VIC hardware
ApplicaXon
Verbs IB core
Cisco USNIC driver
Bootstrapping and setup
Send and receive fast path
usNIC TCP/IP
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 14
MPI
MPI directly injects
L2 frames to the network
MPI receives L2 frames directly from the VIC
Userspace verbs library
Cisco VIC hardware
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 15
I/O MMU SR-IOV NIC
VIC
Classifier
x86 Chipset VT-‐d
MPI process
QP QP Queue pair
MPI process
Inbound L2 frames
Outbound L2 frames
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 16
VIC
VF VF
VF VF VF
Physical port
Physical port
Physical FuncXon (PF) Physical FuncXon (PF) MAC address: aa:bb:cc:dd:ee:ff MAC address: aa:bb:cc:dd:ee:fe
VF VF
VF VF VF
VF QP
QP QP
QP VF
QP
QP QP
QP
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 17
VIC
PF (MAC)
VF VF VF
VF VF VF
PF (MAC)
VF VF VF
VF VF VF
MPI process
MPI process Physical port Physical port
Intel IO MMU QP QP QP QP
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 18
• Used for physical ßà virtual memory translaXon • usnic verbs driver programs (and deprograms) the IOMMU
Intel IO MMU VIC Virtual Userspace process
Virtual
Physical
RAM
Virtual Physical
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 19
• For the purposes of this talk, let’s assume that each physical port has one Linux ethX device
• Each ethX device corresponds to a PF
• Each usnic_Y device corresponds to an ethX device2
VIC Physical port 0 eth4 / usnic_0
Physical port 1 eth5 / usnic_1
(fiber)
Physical port Physical port
Cisco Public © 2013 Cisco and/or its affiliates. All rights reserved. 20
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
usnic_0
PCI 1137:0043
eth5
usnic_1
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda
PCI 1137:0043
eth6
usnic_2
PCI 1137:0043
eth7
usnic_3
Indexes: physical
Date: Thu Nov 7 10:58:23 2013
Intel Xeon E5-‐2690 (“Sandy Bridge”) 2 sockets, 8 cores, 64GB per socket
VIC ports
VIC ports
Cisco Public © 2013 Cisco and/or its affiliates. All rights reserved. 21
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
usnic_0
PCI 1137:0043
eth5
usnic_1
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda
PCI 1137:0043
eth6
usnic_2
PCI 1137:0043
eth7
usnic_3
Indexes: physical
Date: Thu Nov 7 10:58:23 2013
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
usnic_0
PCI 1137:0043
eth5
usnic_1
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda
PCI 1137:0043
eth6
usnic_2
PCI 1137:0043
eth7
usnic_3
Indexes: physical
Date: Thu Nov 7 10:58:23 2013
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 22
ApplicaXon
Byte Transfer Layer (BTL)
OperaXng System
Hardware
Open MPI layer (OMPI)
Point-‐to-‐point messaging layer (PML)
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 23
usnic BTL /dev/usnic_0
VIC 0
OB1 PML
usnic BTL /dev/usnic_1
usnic BTL /dev/usnic_2
usnic BTL /dev/usnic_3
MPI_Send / MPI_Recv (etc.)
VIC 1
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 24
• Byte Transfer Layer • Point-‐to-‐point transfer plugins in OMPI layer
• No protocol is assumed / required
• “usnic” BTL • Uses unreliable datagram (UD) verbs
• Handles all fragmentaXon and re-‐assembly (vs. PML)
• Retransmissions and ACKs handled in sovware
• Sliding window retransmission scheme
• Direct inject / direct receive of L2 Ethernet frames
usnic BTL /dev/usnic_2
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 25
• One BTL module for each usNIC verbs device
• Each module has two UD queue pairs • Priority queue for small and control packets • Data queue for up to MTU-‐sized data packets
• Each QP has its own CQ
• QPs may or may not be on same VF
• Overall BTL glue polls CQs for each device • First, priority CQs
• Then data CQs
Priority QP
Data QP
CQ
CQ
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 26
• “raw” latency (no MPI, no verbs) is 1.57μs
• MPI latency back-‐to-‐back on Sandy Bridge 1.85μs
• Verbs responsible for about 80ns of the difference (not related to MPI API)
• All the rest of OMPI is only about 200ns
Raw: 1.57μs MPI: 200ns
Verbs: 80ns
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 27
• Deferred and piggy-‐backed ACKs
Time
Process A Process B
Msg ACK N Immediate
Msg Msg Msg
ACK N+2
Deferred
Msg Msg Msg
Msg+ACK N+2
Deferred + piggybacked
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 28
• Host writes WQ structure Writes index to VIC via PIO
VIC reads WQ descriptor
VIC reads buffer from RAM
VIC sends buffer from RAM Host VIC
WQ descriptor
Buffer to send
Write WQ index
Read WQ
Read packet
Send on wire
VIC now has buffer address
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 29
• Host writes WQ structure Writes index + encoded buffer address to VIC via PIO
VIC reads WQ descriptor
VIC reads buffer from RAM
VIC sends buffer from RAM Host VIC
WQ descriptor
Buffer to send
Write WQ index+addr Read WQ
Read packet
Send on wire
Send ~400ns sooner
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 30
• Minimize length of priority receive queue • Using 2048 different receive buffers 200ns worse than using 64
• Result of IOMMU cache effect
• We scale length of priority RQ with number of processes in job
Intel IO MMU VIC Virtual Userspace process
Virtual
Physical
Use this much
Instead of this much
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 31
• Use fastpaths wherever possible Be friendly to the opXmizer and instrucXon cache
Made a noXceable difference (!)
if (fastpathable)! do_it_inline();!else! call_slower_path();!
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 32
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
usnic_0
PCI 1137:0043
eth5
usnic_1
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda
PCI 1137:0043
eth6
usnic_2
PCI 1137:0043
eth7
usnic_3
Indexes: physical
Date: Thu Nov 7 10:58:23 2013
MPI processes running on these cores…
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 33
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
usnic_0
PCI 1137:0043
eth5
usnic_1
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda
PCI 1137:0043
eth6
usnic_2
PCI 1137:0043
eth7
usnic_3
Indexes: physical
Date: Thu Nov 7 10:58:23 2013
MPI processes running on these cores…
Only use these usNIC devices for short messages
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 34
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
usnic_0
PCI 1137:0043
eth5
usnic_1
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda
PCI 1137:0043
eth6
usnic_2
PCI 1137:0043
eth7
usnic_3
Indexes: physical
Date: Thu Nov 7 10:58:23 2013
MPI processes running on these cores…
Use ALL usNIC devices for long messages
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 35
• Everything above the firmware is open source
• Open MPI DistribuXng Cisco Open MPI 1.6.5 Upstream in Open MPI 1.7.3
• Libibverbs plugin
• Verbs kernel module
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 36
Hardware
• Cisco UCS C220 M3 Rack Server • Intel E5-‐2690 Processor 2.9 GHz (3.3 GHz Turbo), 2 Socket, 8 Cores/Socket • 1600 MHz DDR3 Memory, 8 GB x 16, 128 GB installed • Cisco VIC 1225 with Ultra Low Latency Networking usNIC Driver
• Cisco Nexus 3548 • 48 Port 10 Gbps Ultra Low Latency Ethernet Networking Switch
SoMware
• OS: Centos 6.4, Kernel: 2.6.32-‐358.el6.x86_64 (SMP)
• NetPIPE (ver 3.7.1)
• Intel MPI Benchmarks (ver 3.2.4)
• High Performance Linpack (ver 2.1)
• Other: Intel C Compiler (ver 13.0.1), Open MPI (ver 1.6.5), Cisco usNIC (1.0.0.7x)
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 37
0
2500
5000
7500
10000
1
10
100
1000
10000
1 4 12
19
27
35
51
67
99
131
195
259
387
515
771
1027
1539
2051
3075
4099
6147
8195
12291
16387
24579
32771
49155
65539
98307
131075
196611
262147
393219
524291
786435
1048579
1572867
2097155
3145731
4194307
6291459
8388611
Throughp
ut (M
bps)
Latency (usecs)
Message Size (bytes)
Cisco usNIC Latency Cisco usNIC Throughput
2.05 usecs latency for small messages
9.3 Gbps Throughput
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 38
0
300
600
900
1200
1
10
100
1000
10000
4 16 64 256 1024 4096 16384 65536 262144 1048576 4194304
Throughp
ut (M
B/s)
Latecny (usecs)
Message Size (bytes)
PingPong ThroughPut (MB/s) PingPing Througput (MB/s) PingPong Latency (usecs) PingPing Latency (usecs)
2.05 usecs PingPong Latency 2.10 usecs PingPing Latency
PingPing and PingPong Latency track together!
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 39
0
600
1200
1800
2400
1
10
100
1000
10000
4 16 64 256 1024 4096 16384 65536 262144 1048576 4194304
Throughp
ut (M
B/s)
Latecny (usecs)
Message Size (bytes)
SendRecv Throughput (MB/s) Exchange Throughput (MB/s) SendRecv Latency (usecs) Exchange Latency (usecs)
2.11 usecs SendRecv Latency 2.58 usecs Exchange Latency
Full Bi-‐direcBonal Performance for both Exchange and SendRecv
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 40
16 32 64 128 256 512 GFlops 340.51 673.68 1271.14 2647.09 5258.27 9773.45
0
2500
5000
7500
10000
12500
GFlop
s
# of CPU Cores
GFLOPS = FLOPS/Cycle x Num CPU Cores x Freq (GHz) E5-‐2690 Max GFLOPS = 8 x 16 x 3.3 = 422 GFLOPS
Single Node HPL Score (16 cores): 340.51 GFLOPS* 32 Node HPL Score (512 cores): 9,773.45 GFLOPS Efficiency based on Single Machine Score: (9,773.45)/(340.51 x 32) x 100 = 89.69%
* Score may improve with addiBonal compiler serngs or newer compiler versions
Thank you.