High speed networking Configuration and testing€¦ · Configuration and testing. 2 InfiniBand...
Transcript of High speed networking Configuration and testing€¦ · Configuration and testing. 2 InfiniBand...
Satinder Nijjar, 2018
High speed networking Configuration and testing
2
InfiniBand Setup and Verification
3
Info: Checking InfiniBand
Using standard Linux commands
Check the InfiniBand cards visible by the OS and the drivers are loaded
lspci, lsmod
Version of software is installed
modinfo mlx5_core
Version of firmware on the HCA
4
Verify ConnectX Card is WorkingRun “lspci” to ensure all four IB cards are recognized by the system. The output should show all four controller
Run “lsmod” and verify that the InfiniBand drivers are present. The output should consist of a list of lb_ and mlx_ driver components
$ lspci | grep –i mellanox
05:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
0c:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
84:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
8b:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
$ lsmod | grep -e ib_ -e mlx_
ib_ucm 20480 0
ib_ipoib 131072 0
ib_cm 45056 3 rdma_cm,ib_ucm,ib_ipoib
ib_uverbs 73728 2 ib_ucm,rdma_ucm
ib_umad 24576 0
mlx5_ib 192512 0
mlx4_ib 192512 0
ib_sa 36864 5 rdma_cm,ib_cm,mlx4_ib,rdma_ucm,ib_ipoib
ib_mad 57344 4 ib_cm,ib_sa,mlx4_ib,ib_umad
ib_core 143360 13
rdma_cm,ib_cm,ib_sa,iw_cm,nv_peer_mem,mlx4_ib,mlx5_ib,ib_mad,ib_ucm,ib_umad,ib_u
verbs,rdma_ucm,ib_ipoib
ib_addr 20480 3 rdma_cm,ib_core,rdma_ucm
ib_netlink 16384 3 rdma_cm,iw_cm,ib_addr
mlx4_core 344064 2 mlx4_en,mlx4_ib
mlx5_core 524288 1 mlx5_ib
mlx_compat 16384 18
rdma_cm,ib_cm,ib_sa,iw_cm,mlx4_en,mlx4_ib,mlx5_ib,ib_mad,ib_ucm,ib_netlink,ib_ad
dr,ib_core,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm,ib_ipoib
5
Verify Software ReleaseVerify that the OFED software is correctly installed
DGX-1 OS release 1.0 - OFED software 3.2DGX-1 OS release 2.0 - OFED software 3.4DGX-1 OS release 3.0 - OFED software 4.0 DGX-1 OS release 4.0 - OFED software 4.4
Restart the InfiniBand service
Restart the Service Manager service
Verify that both services have started
For further reference, check the User Guide chapter on Infiniband card replacement:
$ modinfo mlx5_core | grep -i version | head -1
Version : 4.4-2.0.7
$ sudo service openibd restart
$ service opensmd restart
$ service openibd status
$ service opensmd status
http://docs.nvidia.com/dgx/dgx1-user-guide/maintenance.html#task_setting-up-infiniband
6
Verify Firmware ReleaseVerify the firmware version.
DGX-1 OS release 1.0 is 12.16.1020
DGX-1 OS release 2.0 is 12.17.1010
DGX-1 OS release 3.0 is 12.18.1000
DGX-1 OS release 4.0 is 12.24.1000
If the firmware version does not match and requires update. Run this script to perform firmware update. After the reboot, repeat step d) to confirm version
$ cat /sys/class/infiniband/mlx5*/fw_ver
12.24.1000
12.24.1000
12.24.1000
12.24.1000
$ sudo /opt/mellanox/mlnx-fw-updater/mlnx_fw_updater.pl
Attempting to perform Firmware update...
Querying Mellanox devices firmware ...
Device #1:
----------
Device Type: ConnectX4
Part Number: MCX455A-ECA_Ax
Description: ConnectX-4 VPI adapter card; EDR IB (100Gb/s) and 100GbE;
single-port QSFP28; PCIe3.0 x16; ROHS R6
PSID: MT_2180110032
PCI Device Name: 05:00.0
Base GUID: 248a0703004a5368
Base MAC: 0000248a074a5368
Versions: Current Available
FW 12.16.1020 12.24.1000
PXE 3.4.0812 3.5.0603
UEFI 14.16.0017 14.17.0011
Status: Update required
…snipped…
Status: Up to date
Log File: /tmp/mlnx_fw_update.log
Please reboot your system for the changes to take effect.
7
Challenge: IB Setup & Verification
How many InfiniBand Cards are installed?
What are the PCI bus addresses of all the IB cards on your system?
What version of OFED software is present?
What is the firmware version on each card?
Hint: “ibstat -l“ will list all Mellanox Devices
8
Solution: IB Setup & Verification
Retention Clip
ib3: PCI 8b.00.0
$ lspci | grep -i mellanox
05:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
0c:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
84:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
8b:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
ib2: PCI 84.00.0
ib1: PCI c.00.0
ib0: PCI 5.00.0
How many InfiniBand Cards are installed?
What are the PCI bus addresses of all the IB cards on your system?
Use the “lspci” command to list all PCI buses and devices in the system
9
Solution
What version of OFED software is present?
What is the firmware version on each card?
$ cat /sys/class/infiniband/mlx5*/fw_ver
12.17.1010
12.17.1010
12.17.1010
12.17.1010
$ modinfo mlx5_core | grep -i version | head -1
Version : 4.4-2.0.7
10
IB/Ethernet Mode
11
Info: InfiniBand or Ethernet
Why IB mode?
IB is the default mode for DGX-1 clustering
For multi node training
Why Ethernet mode?
NCCL with RDMA can also be used for multi node training (RoCE)
Customer can leverage existing NAS systems
How do you change modes
Using the Mellanox Software Tools we can toggle the device at “/dev/mst/mt4115_pciconf#“ between 1 for InfiniBand and 2 for Ethernet
12
UPDATING LINK PROTOCOLa) Run “lspci” to identify the current link
protocol. To reset link type, we would need to download and install the Mellanox Firmware Tools (MFT) athttp://www.mellanox.com/page/management_tools
a) Start the mst driver by typing “sudo mst start”. Query the host for the Mellanox device ID MT4115. This system has four adapters 0-3.
a) Set link type:InfiniBand = 1Ethernet = 2E.g. Set device 0 to link type 1
Repeat to set link type for each adapter. Reboot. Repeat step a) to verify new settings
$ lspci | grep –i mellanox
05:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
0c:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
84:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
8b:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
OR
05:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
0c:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
84:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
8b:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
# ls –ls /dev/mst
0 crw------- 1 root root 238, 0 Mar 13 15:44 mt4115_pciconf0
0 crw------- 1 root root 238, 0 Mar 13 15:44 mt4115_pciconf1
0 crw------- 1 root root 238, 0 Mar 13 15:44 mt4115_pciconf2
0 crw------- 1 root root 238, 0 Mar 13 15:44 mt4115_pciconf3
# mlxconfig –d /dev/mst/mt4115_pciconf0 set LINK_TYPE_P1=1
Device #1:
----------
Device type: ConnectX4
Name: N/A
Description: N/A
Device: /dev/mst/mt4115_pciconf0
Congirations: Next Boot New
LINK_TYPE_P1 IB(1) IB(1)
Apply new Configuration? ? (y/n) [n] : _
13
Determine the Port ConfigurationRun “ibv_devinfo | grep -e "hca_id\|state\|link_layer"” to determine the current link
configuration and state :
Cards configured for InfiniBand
Cards configured for Ethernet
“ibstat” could also have been used to collect this information
$ ibv_devinfo | grep -e "hca_id\|link_layer"
hca_id: mlx5_3 state:
PORT_ACTIVE (4)
link_layer:
Ethernet
hca_id: mlx5_2 state:
PORT_ACTIVE (4)
link_layer:
Ethernet
hca_id: mlx5_1 state:
PORT_ACTIVE (4)
link_layer:
Ethernet
$ ibv_devinfo | grep -e "hca_id\|state\|link_layer"
hca_id: mlx5_3 state:
PORT_ACTIVE (4)
link_layer:
InfiniBand
hca_id: mlx5_2 state:
PORT_ACTIVE (4)
link_layer:
InfiniBand
hca_id: mlx5_1 state:
PORT_ACTIVE (4)
link_layer:
InfiniBand
hca_id: mlx5_0 state:
PORT_ACTIVE (4)
link_layer:
InfiniBand
14
Verify Status using ibstatCA 'mlx5_2'
CA type: MT4115
Number of ports: 1
Firmware version: 12.24.1000
Hardware version: 0
Node GUID: 0x248a0703001effde
System image GUID: 0x248a0703001effde
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 100
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x2651e848
Port GUID: 0x248a0703001effde
Link layer: InfiniBand
CA 'mlx5_3'
CA type: MT4115
Number of ports: 1
Firmware version: 12.24.1000
Hardware version: 0
Node GUID: 0x7cfe900300118f22
System image GUID: 0x7cfe900300118f22
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 100
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x2651e848
Port GUID: 0x7cfe900300118f22
Link layer: InfiniBand
$ ibstat
CA 'mlx5_0'
CA type: MT4115
Number of ports: 1
Firmware version: 12.24.1000
Hardware version: 0
Node GUID: 0x248a0703000de288
System image GUID: 0x248a0703000de288
Port 1:
State: Down
Physical state: Polling
Rate: 10
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x2651e848
Port GUID: 0x248a0703000de288
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4115
Number of ports: 1
Firmware version: 12.24.1000
Hardware version: 0
Node GUID: 0x248a0703000de26c
System image GUID: 0x248a0703000de26c
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 100
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x2651e848
Port GUID: 0x248a0703000de26c
Link layer: InfiniBand
15
Mellanox Software Tools (mst)Start Mellanox Software Tools (mst) and verify the module loaded
$ sudo mst status
MST modules:
------------
MST PCI module is not loaded
MST PCI configuration module loaded
MST devices:
------------
/dev/mst/mt4115_pciconf0 - PCI configuration cycles access.
domain:bus:dev.fn=0000:05:00.0 adadr.reg=88
data.reg=92
Chip revision is: 00
/dev/mst/mt4115_pciconf1 - PCI configuration cycles access.
domain:bus:dev.fn=0000:0c:00.0 addr.reg=88
data.reg=92
Chip revision is: 00
/dev/mst/mt4115_pciconf2 - PCI configuration cycles access.
domain:bus:dev.fn=0000:84:00.0 addr.reg=88
data.reg=92
Chip revision is: 00
/dev/mst/mt4115_pciconf3 - PCI configuration cycles access.
domain:bus:dev.fn=0000:8b:00.0 addr.reg=88
data.reg=92
Chip revision is: 00
$ sudo mst status
MST modules:
------------
MST PCI module is not loaded
MST PCI configuration module is not loaded
PCI Devices:
------------
05:00.0
84:00.0
0c:00.0
8b:00.0
$ sudo mst start
16
Update the port configurations to Ethernet$ sudo mlxconfig -y -d /dev/mst/mt4115_pciconf0 set LINK_TYPE_P1=2
$ sudo mlxconfig -y -d /dev/mst/mt4115_pciconf1 set LINK_TYPE_P1=2
$ sudo mlxconfig -y -d /dev/mst/mt4115_pciconf2 set LINK_TYPE_P1=2
$ sudo mlxconfig -y -d /dev/mst/mt4115_pciconf3 set LINK_TYPE_P1=2
Change the configuration to all four ports
Verify the configuration changes were appliedReboot the system
Verify that the desired configuration is now running
$ sudo mlxconfig query |grep -e LINK_TYPE -e "Device.*mst"
PCI device: /dev/mst/mt4115_pciconf3 LINK_TYPE_P1 ETH(2)
PCI device: /dev/mst/mt4115_pciconf2 LINK_TYPE_P1 ETH(2)
PCI device: /dev/mst/mt4115_pciconf1 LINK_TYPE_P1 ETH(2)
PCI device: /dev/mst/mt4115_pciconf0 LINK_TYPE_P1 ETH(2)
$ sudo reboot
$ ibv_devinfo |grep -e "hca_id\|link_layer"
hca_id: mlx5_3 link_layer: Ethernet
hca_id: mlx5_2 link_layer: Ethernet
hca_id: mlx5_1 link_layer: Ethernet
hca_id: mlx5_0 link_layer: Ethernet
17
Update the port configurations to InfiniBand$ sudo mlxconfig -y -d /dev/mst/mt4115_pciconf0 set LINK_TYPE_P1=1
$ sudo mlxconfig -y -d /dev/mst/mt4115_pciconf1 set LINK_TYPE_P1=1
$ sudo mlxconfig -y -d /dev/mst/mt4115_pciconf2 set LINK_TYPE_P1=1
$ sudo mlxconfig -y -d /dev/mst/mt4115_pciconf3 set LINK_TYPE_P1=1
Change the configuration to all four ports
Verify the configuration changes were appliedReboot the system
Verify that the desired configuration is now running
$ sudo mlxconfig query |grep -e LINK_TYPE -e "Device.*mst"
PCI device: /dev/mst/mt4115_pciconf3 LINK_TYPE_P1 IB(1)
PCI device: /dev/mst/mt4115_pciconf2 LINK_TYPE_P1 IB(1)
PCI device: /dev/mst/mt4115_pciconf1 LINK_TYPE_P1 IB(1)
PCI device: /dev/mst/mt4115_pciconf0 LINK_TYPE_P1 IB(1)
$ sudo reboot
$ ibv_devinfo |grep -e "hca_id\|link_layer"
hca_id: mlx5_3 link_layer: InfiniBand
hca_id: mlx5_2 link_layer: InfiniBand
hca_id: mlx5_1 link_layer: InfiniBand
hca_id: mlx5_0 link_layer: InfiniBand
18
Challenge: IB/Ethernet mode switching
Determine the current port configuration
Start Mellanox Software Tools (mst)
Set the one of the cards (mt4115_pciconf#) to Ethernet mode
19
Solution: IB/Ethernet mode switching
Determine the current port
configuration of all 4 cards
$ ibv_devinfo | grep -e "hca_id\|state\|link_layer"
hca_id: mlx5_3
link_layer:
InfiniBand
hca_id: mlx5_2
link_layer:
InfiniBand
hca_id: mlx5_1
link_layer:
InfiniBand
hca_id: mlx5_0
link_layer:
InfiniBand
20
Solution: IB/Ethernet mode switching
MST tools
$ sudo mst status
MST modules:
------------
MST PCI module is not loaded
MST PCI configuration module is not loaded
PCI Devices:
------------
05:00.0
84:00.0
0c:00.0
8b:00.0
$ sudo mst start
$ sudo mst status
MST modules:
------------
MST PCI module is not loaded
MST PCI configuration module loaded
MST devices:
------------
/dev/mst/mt4115_pciconf0 - PCI configuration cycles access.
domain:bus:dev.fn=0000:05:00.0 adadr.reg=88
data.reg=92
Chip revision is: 00
/dev/mst/mt4115_pciconf1 - PCI configuration cycles access.
domain:bus:dev.fn=0000:0c:00.0 addr.reg=88
data.reg=92
Chip revision is: 00
/dev/mst/mt4115_pciconf2 - PCI configuration cycles access.
domain:bus:dev.fn=0000:84:00.0 addr.reg=88
data.reg=92
Chip revision is: 00
/dev/mst/mt4115_pciconf3 - PCI configuration cycles access.
domain:bus:dev.fn=0000:8b:00.0 addr.reg=88
data.reg=92
Chip revision is: 00
21
Solution: IB/Ethernet mode switching
Set the 4th card to Ethernet mode
$ sudo mlxconfig -y -d /dev/mst/mt4115_pciconf0 set LINK_TYPE_P1=2
Device #4:----------
Device type: ConnectX4 PCI device: /dev/mst/mt4115_pciconf0
Configurations: Next Boot NewLINK_TYPE_P1 IB(1) ETH(2)
Apply new Configuration? ? (y/n) [n] : yApplying... Done!-I- Please reboot machine to load new configurations.
$ sudo reboot
(have a coffee)
$ ibv_devinfo | grep -e "hca_id\|link_layer"
hca_id: mlx5_3link_layer: Ethernet
22
Solution: IB/Ethernet mode switching
Set the 4th card to InfiniBand mode
$ sudo mst start$ sudo mlxconfig -y -d /dev/mst/mt4115_pciconf0 set LINK_TYPE_P1=1
Device #1:----------
Device type: ConnectX4 PCI device: /dev/mst/mt4115_pciconf0
Configurations: Next Boot NewLINK_TYPE_P1 IB(1) ETH(2)
Apply new Configuration? ? (y/n) [n] : yApplying... Done!-I- Please reboot machine to load new configurations.
$ sudo reboot
(have a coffee)
$ ibv_devinfo | grep -e "hca_id\|link_layer"
hca_id: mlx5_3link_layer: Infiniband
23
Bandwidth & Latency between nodes
24
Info: Bandwidth and Latency
ib_read_bwThis command is part of Mellanox testperf package: https://community.mellanox.com/docs/DOC-2086. This command
will be installed with the installation of MLNX_OFED).
Example:
(server)
ib_read_bw -d mlx5_2
(client)
ib_read_bw -d mlx5_2 --report_gbits <server IP address>
ib_read_lat
This command calculates the latency of RDMA read operation of message_size between a pair of DGX-1’s.
Example:
(server)
ib_read_lat -d mlx5_2
(client)
ib_read_lat -d mlx5_2 <server IP address>
25
Team Challenge: Bandwidth and Latency
● What is the bandwidth between two DGX-1s?
● Is the bandwidth the same in both directions?
● What is the latency between two DGX-1s?
● Is the latency the same in both directions?
● How does latency compare to ICMP (“ping”)?
26
Solution: Bandwidth and Latency
● What is the bandwidth between two DGX-1s?
(server, to see the ip) $ ip addr (server) $ ib_read_bw -d mlx5_0
(client) $ ib_read_bw --report_gbits <ip addr>
---------------------------------------------------------------------------------------RDMA_Read BW Test
Dual-port : OFF Device : mlx5_3Number of qps : 1 Transport type : IBConnection type : RC Using SRQ : OFFTX depth : 128CQ Moderation : 100Mtu : 4096[B]Link type : IBOutstand reads : 16rdma_cm QPs : OFFData ex. method : Ethernet
---------------------------------------------------------------------------------------local address: LID 0x23 QPN 0x2fe6 PSN 0xe11032 OUT 0x10 RKey 0x1bb7a3 VAddr 0x002aaaaab30000remote address: LID 0x1b QPN 0x493d PSN 0x6afd86 OUT 0x10 RKey 0x09f4bc VAddr 0x002aaaaab30000
---------------------------------------------------------------------------------------#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 2862.750000 != 1453.460000. CPU Frequency is not max.65536 1000 95.35 95.31 0.181792
---------------------------------------------------------------------------------------
27
Solution: Bandwidth and Latency (contd.)
● What is the latency between two DGX-1s?
(server, to see the ip) $ ip addr (server) $ ib_read_lat -d mlx5_0
(client) $ ib_read_lat <ip addr>
---------------------------------------------------------------------------------------RDMA_Read Latency Test
Dual-port : OFF Device : mlx5_3Number of qps : 1 Transport type : IBConnection type : RC Using SRQ : OFFTX depth : 1Mtu : 4096[B]Link type : IBOutstand reads : 16rdma_cm QPs : OFFData ex. method : Ethernet
---------------------------------------------------------------------------------------local address: LID 0x23 QPN 0x2fe7 PSN 0xd8e1de OUT 0x10 RKey 0x1bae95 VAddr 0x002aaaaaad9000remote address: LID 0x1b QPN 0x493e PSN 0x56549e OUT 0x10 RKey 0x0a60df VAddr 0x002aaaaaadb000
---------------------------------------------------------------------------------------#bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9%
percentile[usec]Conflicting CPU frequency values detected: 2011.625000 != 2624.273000. CPU Frequency is not max.Conflicting CPU frequency values detected: 2671.710000 != 3597.257000. CPU Frequency is not max.2 1000 2.35 17.08 2.40 2.43
0.55 2.48 17.08 ---------------------------------------------------------------------------------------
29
Solution: Bandwidth and Latency (contd.)
● How does latency compare to ICMP (“ping”)?
(server, to see the ip) $ ip addr
(client) $ ping <ip addr>
PING 10.31.229.56 (10.31.229.56) 56(84) bytes of data.
64 bytes from 10.31.229.56: icmp_seq=1 ttl=64 time=0.306 ms
64 bytes from 10.31.229.56: icmp_seq=2 ttl=64 time=0.185 ms
64 bytes from 10.31.229.56: icmp_seq=3 ttl=64 time=0.285 ms
64 bytes from 10.31.229.56: icmp_seq=4 ttl=64 time=0.269 ms
64 bytes from 10.31.229.56: icmp_seq=5 ttl=64 time=0.241 ms
--- 10.31.229.56 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 3998ms
rtt min/avg/max/mdev = 0.185/0.257/0.306/0.043 ms
257 µs over Ethernet vs 2.3 µs over IB (IB is ~100x faster)
30
Troubleshooting IB State
31
Using ibstat to troubleshoot connection states
CA 'mlx5_1'
CA type: MT4115
Number of ports: 1
Firmware version: 12.17.1010
Hardware version: 0
Node GUID: 0x248a0703000de26c
System image GUID: 0x248a0703000de26c
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x2651e848
Port GUID: 0x248a0703000de26c
Link layer: InfiniBand
CA 'mlx5_2'
CA type: MT4115
Number of ports: 1
Firmware version: 12.17.1010
Hardware version: 0
Node GUID: 0x248a0703001effde
System image GUID: 0x248a0703001effde
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x2651e848
Port GUID: 0x248a0703001effde
Link layer: InfiniBand
CA 'mlx5_3'
CA type: MT4115
Number of ports: 1
Firmware version: 12.17.1010
Hardware version: 0
Node GUID: 0x7cfe900300118f22
System image GUID: 0x7cfe900300118f22
Port 1:
State: Inactive
Physical state: LinkUp
Rate: 100
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x2651e848
Port GUID: 0x7cfe900300118f22
Link layer: InfiniBand
Physical State The physical state field indicates the state of the cable.
This is very similar to the link state on Ethernet.
Polling
There is no connection from this card to another card or switch.
Check to make sure cable is installed and the device on the other end of the cable is on
and working properly.
LinkUpThere is link and connection between this node and the device at the other end of the cable.This doesn’t mean it’s configured and ready to send data, just that the physical connection is up.
State The state shows if the HCA port is up, and if it’s been discovered by the subnet manager.
DownThere is no physical connection between the HCA card in this node and the device at theother end of the cable. This is almost always seen when ‘Physical State’ shows the value ‘Polling.’
Initializing Physical connection has been made between the HCA in this node and the device at the other end of the cable, but it hasn’t been discovered by the subnet manager. You need to make sure you have a managed switch, or more likely that the ‘opensm‘ process is running on a node in your cluster
Active The physical connection is up and working, and the port has been discovered by the subnet manager.
The port is in a normal operational state.
RateThe rate is the speed at which the port is operating. This should match the speed of the slowest device between the node's HCA
and the device at the other end of the cable.
Description of the ibstat output: