作者: | 来源:互联网 | 2023-08-10 18:22
环境
纸上谈兵了这么多,我们还是来做一下rdma的测试看看。公司正好有mellanox的网卡,网卡是
[root@localhost ~]# lspci -vvv |grep Eth
01:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
01:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
Linux版本
[root@localhost ~]# cat /etc/redhat-release CentOS Linux release 7.9.2009 (Core) [root@localhost ~]# uname -r 3.10.0-1160.el7.x86_64
固件版本是
[root@localhost bak]# flint -d /dev/mst/mt4117_pciconf0 -i fw-ConnectX4Lx-rel-14_31_1014-MCX4121A-ACA_Ax-UEFI-14.24.13-FlexBoot-3.6.403.bin burn
Current FW version on flash: 14.23.1020
New FW version: 14.31.1014FSMST_INITIALIZE - OK
Writing Boot image component - OK
-I- To load new FW run mlxfwreset or reboot machine.
安装OFED
mellanox的ofed下载地址如下:
https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/
下载自己操作系统对应的版本
tar xvf MLNX_OFED_SRC-5.5-1.0.3.2.tgz cd MLNX_OFED_SRC-5.5-1.0.3.2/ ./install.pl
安装完之后,看到了GUID和若干PASS的状态
[root@localhost MLNX_OFED_SRC-5.5-1.0.3.2]# hca_self_test.ofed
---- Performing Adapter Device Self Test ----
Number of CAs Detected ................. 2
PCI Device Check ....................... PASS
Kernel Arch ............................ x86_64
Host Driver Version .................... OFED-internal-5.5-1.0.3: 3.10.0-1160.el7.x86_64
Host Driver RPM Check .................. PASS
Firmware on CA #0 NIC .................. v14.23.1020
Firmware on CA #1 NIC .................. v14.23.1020
Host Driver Initialization ............. PASS
Number of CA Ports Active .............. 0
Port State of Port #1 on CA #0 (NIC)..... DOWN (Ethernet)
Port State of Port #1 on CA #1 (NIC)..... DOWN (Ethernet)
Error Counter Check on CA #0 (NIC)...... PASS
Error Counter Check on CA #1 (NIC)...... PASS
Kernel Syslog Check .................... PASS
Node GUID on CA #0 (NIC) ............... 98:03:9b:03:00:48:bd:c8
Node GUID on CA #1 (NIC) ............... 98:03:9b:03:00:48:bd:c9
可以输入一些命令查看ib的状态
[root@localhost MLNX_OFED_SRC-5.5-1.0.3.2]# ibdev2netdev //查看以太网设备和IB设备/端口之间的关联
mlx5_0 port 1 ==> eth1 (Down)
mlx5_1 port 1 ==> eth2 (Down)
[root@localhost MLNX_OFED_SRC-5.5-1.0.3.2]# ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0) //IB协议
fw_ver: 14.23.1020
node_guid: 9803:9b03:0048:bdc8
sys_image_guid: 9803:9b03:0048:bdc8
vendor_id: 0x02c9
vendor_part_id: 4117
hw_ver: 0x0
board_id: MT_2420110034
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 14.23.1020
node_guid: 9803:9b03:0048:bdc9
sys_image_guid: 9803:9b03:0048:bdc8
vendor_id: 0x02c9
vendor_part_id: 4117
hw_ver: 0x0
board_id: MT_2420110034
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
从上面的打印来看,目前的state还是PORT_DOWN,而且link_layer不是IB模式,网上说要修改LINK_TYPE_P1为1(1是IB模式,2是ethernet模式)
[root@localhost ~]# mlxconfig -d /dev/mst/mt4117_pciconf0 query |grep LINK
但是没找到LINK_TYPE_P1这个选项。
怀疑是不是固件版本的问题
更新固件试试
网上查了一下,需要下一个MST的工具包
https://network.nvidia.com/products/adapter-software/firmware-tools/
tar xvf mft-4.18.0-106-x86_64-rpm.tgz
cd mft-4.18.0-106-x86_64-rpm/
./install.sh
mst start
service mst status
下载最新版本的固件
https://network.nvidia.com/support/firmware/connectx4lxen/
[root@localhost bak]# flint -d /dev/mst/mt4117_pciconf0 -i fw-ConnectX4Lx-rel-14_31_1014-MCX4121A-ACA_Ax-UEFI-14.24.13-FlexBoot-3.6.403.bin burn
Current FW version on flash: 14.23.1020
New FW version: 14.31.1014FSMST_INITIALIZE - OK
Writing Boot image component - OK
-I- To load new FW run mlxfwreset or reboot machine.
没有效果
下载老一点的驱动,5.1的,替换5.5的驱动,还是不行
后来在这个网址看到如下信息:
https://access.redhat.com/articles/3082811
Note that the card in the example output is an Ethernet-only card, so there is no port type setting.
这里就提到了connect4x lx网卡是不支持IB的,但是为啥mlxconfig query又显示transport是IB呢,太奇怪了。
感觉无法做这个测试了。
transport: InfiniBand (0)
而且connect4x lx和connect4x都是mlx5芯片的 ,原生就应该支持IB,为啥要搞出个不支持rdma的板卡呢。
https://mymellanox.force.com/mellanoxcommunity/s/question/0D51T00008dGyJMSA0/how-to-use-mellanox-connectx4-lx
这个网址同样提到
Unfortunately, I'm starting to think that I have the wrong card (and that this only works for Ethernet), because I am unable to change the link type of this card to infiniband. I have followed all the instructions, but it says that the option (LINK_TYPE) isn't found when I try via the command line.