为了学习RDMA,在网上找到了一个example,和MELLANOX提供的类似,但是当我用两台机器运行时,发现了以下问题:
1.测试的代码带宽与Perftest测试的带宽有很大差距。
2.除此之外,在两台机器中的一台上使用GID 0或2会显着减少带宽。
机器A:
配置:
hca_id: mlx5_bond_0
transport: InfiniBand (0)
fw_ver: 20.39.3004
node_guid: 1070:fd03:00e5:f118
sys_image_guid: 1070:fd03:00e5:f118
vendor_id: 0x02c9
vendor_part_id: 4123
hw_ver: 0x0
board_id: MT_0000000224
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
mlx5_bond_0 1 0 fe80:0000:0000:0000:b0fc:4eff:feb3:1112 v1 bond0
mlx5_bond_0 1 1 fe80:0000:0000:0000:b0fc:4eff:feb3:1112 v2 bond0
mlx5_bond_0 1 2 0000:0000:0000:0000:0000:ffff:0a77:2e3d 10.119.46.61 v1 bond0
mlx5_bond_0 1 3 0000:0000:0000:0000:0000:ffff:0a77:2e3d 10.119.46.61 v2 bond0
在 GID 1 上进行 perftest 测试
---------------------------------------------------------------------------------------
RDMA_Read BW Test
RX depth: 1
post_list: 1
inline_size: 0
Dual-port : OFF Device : mlx5_bond_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
CQ Moderation : 1
Mtu : 1024[B]
Link type : Ethernet
GID index : 1
Outstand reads : 16
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x1659 PSN 0xd4858a OUT 0x10 RKey 0x203e00 VAddr 0x007f38d0d07000
GID: 254:128:00:00:00:00:00:00:176:252:78:255:254:179:17:18
remote address: LID 0000 QPN 0x1c86 PSN 0xc2e51a OUT 0x10 RKey 0x013f00 VAddr 0x007f123fc62000
GID: 254:128:00:00:00:00:00:00:100:155:154:255:254:172:09:41
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
65536 1000 10829.53 10829.17 0.173267
---------------------------------------------------------------------------------------
机器B:
hca_id: mlx5_bond_0
transport: InfiniBand (0)
fw_ver: 20.39.3004
node_guid: e8eb:d303:0032:b212
sys_image_guid: e8eb:d303:0032:b212
vendor_id: 0x02c9
vendor_part_id: 4123
hw_ver: 0x0
board_id: MT_0000000224
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
mlx5_bond_0 1 0 fe80:0000:0000:0000:649b:9aff:feac:0929 v1 bond0
mlx5_bond_0 1 1 fe80:0000:0000:0000:649b:9aff:feac:0929 v2 bond0
mlx5_bond_0 1 2 0000:0000:0000:0000:0000:ffff:0a77:2e3e 10.119.46.62 v1 bond0
mlx5_bond_0 1 3 0000:0000:0000:0000:0000:ffff:0a77:2e3e 10.119.46.62 v2 bond0
n_gids_found=4
在 GID 0 上进行 perftest 测试
RDMA_Read BW Test
RX depth: 1
post_list: 1
inline_size: 0
Dual-port : OFF Device : mlx5_bond_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
CQ Moderation : 1
Mtu : 1024[B]
Link type : Ethernet
GID index : 1
Outstand reads : 16
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x1659 PSN 0xd4858a OUT 0x10 RKey 0x203e00 VAddr 0x007f38d0d07000
GID: 254:128:00:00:00:00:00:00:176:252:78:255:254:179:17:18
remote address: LID 0000 QPN 0x1c86 PSN 0xc2e51a OUT 0x10 RKey 0x013f00 VAddr 0x007f123fc62000
GID: 254:128:00:00:00:00:00:00:100:155:154:255:254:172:09:41
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
65536 1000 10829.53 10829.17 0.173267
---------------------------------------------------------------------------------------
如果我对示例代码进行测试,当M1使用GID0并且M2使用GID0/GID1时,带宽约为0.0124GB/s。当M1使用GID1,M2使用GID1时,带宽约为6GB/s。我想知道perftest代码做了哪些优化,或者上面例子中的代码有哪些缺陷导致了测试的带宽差异很大。
原因是示例代码无法解锁硬件的全部能力,即消息太小! 相比小消息传输(准备时间不可忽视),大消息传输足够大,足以支持全速传输。
使用
--all
程序尝试参数 perftest
,看看 2 和 2^23 字节大小的消息之间的速度差异。