我有几台带有一些网卡的服务器,我安装了 ompi、ucx 和 osu-microbenchmarks。我正在运行以下命令,
mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^vader,tcp,openib,uct -x UCX_NET_DEVICES=mlx5_1:1 -x UCX_TLS=self,sm,rc_v -x UCX_IB_GID_INDEX=3 -hostfile 主机/usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency
在我的其他网卡设置上运行良好,但在我当前的设置上它给了我一个错误,
ud_ep.c:278 Fatal: UD endpoint 0x22fe520 to <no debug data>: unhandled timeout error
==== backtrace (tid: 4061) ====
0 /root/ucx/ucx_install/lib/libucs.so.0(ucs_handle_error+0x294) [0x7fdb4153bda4]
1 /root/ucx/ucx_install/lib/libucs.so.0(ucs_fatal_error_message+0xb2) [0x7fdb41539162]
2 /root/ucx/ucx_install/lib/libucs.so.0(+0x2a239) [0x7fdb41539239]
3 /root/ucx/ucx_install/lib/ucx/libuct_ib.so.0(+0x5e050) [0x7fdb41493050]
4 /root/ucx/ucx_install/lib/libucs.so.0(+0x21467) [0x7fdb41530467]
5 /root/ucx/ucx_install/lib/libucp.so.0(ucp_worker_progress+0x2a) [0x7fdb415ee91a]
6 /root/ompi/ompi_install/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0x157) [0x7fdb41683df7]
7 /root/ompi/ompi_install/lib/libmpi.so.40(ompi_coll_base_barrier_intra_recursivedoubling+0xbb) [0x7fdb438d17fb]
8 /root/ompi/ompi_install/lib/libmpi.so.40(MPI_Barrier+0xa8) [0x7fdb43882ea8]
9 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency() [0x4027fe]
10 /lib64/libc.so.6(+0x44e50) [0x7fdb4332de50]
11 /lib64/libc.so.6(__libc_start_main+0x7c) [0x7fdb4332defc]
12 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency() [0x4031b5]
=================================
[tst-srv-193:04061] *** Process received signal ***
[tst-srv-193:04061] Signal: Aborted (6)
[tst-srv-193:04061] Signal code: (-6)
[tst-srv-193:04061] [ 0] /lib64/libc.so.6(+0x59db0)[0x7fdb43342db0]
[tst-srv-193:04061] [ 1] /lib64/libc.so.6(+0xa642c)[0x7fdb4338f42c]
[tst-srv-193:04061] [ 2] /lib64/libc.so.6(raise+0x16)[0x7fdb43342d06]
[tst-srv-193:04061] [ 3] /lib64/libc.so.6(abort+0xd3)[0x7fdb433157d3]
[tst-srv-193:04061] [ 4] /root/ucx/ucx_install/lib/libucs.so.0(+0x2a167)[0x7fdb41539167]
[tst-srv-193:04061] [ 5] /root/ucx/ucx_install/lib/libucs.so.0(+0x2a239)[0x7fdb41539239]
[tst-srv-193:04061] [ 6] /root/ucx/ucx_install/lib/ucx/libuct_ib.so.0(+0x5e050)[0x7fdb41493050]
[tst-srv-193:04061] [ 7] /root/ucx/ucx_install/lib/libucs.so.0(+0x21467)[0x7fdb41530467]
[tst-srv-193:04061] [ 8] /root/ucx/ucx_install/lib/libucp.so.0(ucp_worker_progress+0x2a)[0x7fdb415ee91a]
[tst-srv-193:04061] [ 9] /root/ompi/ompi_install/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0x157)[0x7fdb41683df7]
[tst-srv-193:04061] [10] /root/ompi/ompi_install/lib/libmpi.so.40(ompi_coll_base_barrier_intra_recursivedoubling+0xbb)[0x7fdb438d17fb]
[tst-srv-193:04061] [11] /root/ompi/ompi_install/lib/libmpi.so.40(MPI_Barrier+0xa8)[0x7fdb43882ea8]
[tst-srv-193:04061] [12] /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x4027fe]
[tst-srv-193:04061] [13] /lib64/libc.so.6(+0x44e50)[0x7fdb4332de50]
[tst-srv-193:04061] [14] /lib64/libc.so.6(__libc_start_main+0x7c)[0x7fdb4332defc]
[tst-srv-193:04061] [15] /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x4031b5]
[tst-srv-193:04061] *** End of error message ***
[tst-srv-192:3952 :0:3952] ud_ep.c:278 Fatal: UD endpoint 0x24344f0 to <no debug data>: unhandled timeout error
==== backtrace (tid: 3952) ====
0 /root/ucx/ucx_install/lib/libucs.so.0(ucs_handle_error+0x294) [0x7f2892c6ada4]
1 /root/ucx/ucx_install/lib/libucs.so.0(ucs_fatal_error_message+0xb2) [0x7f2892c68162]
2 /root/ucx/ucx_install/lib/libucs.so.0(+0x2a239) [0x7f2892c68239]
3 /root/ucx/ucx_install/lib/ucx/libuct_ib.so.0(+0x5e050) [0x7f2892bc2050]
4 /root/ucx/ucx_install/lib/libucs.so.0(+0x21467) [0x7f2892c5f467]
5 /root/ucx/ucx_install/lib/libucp.so.0(ucp_worker_progress+0x2a) [0x7f2892d1d91a]
6 /root/ompi/ompi_install/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0x157) [0x7f2892db2df7]
7 /root/ompi/ompi_install/lib/libmpi.so.40(ompi_coll_base_barrier_intra_recursivedoubling+0xbb) [0x7f2898ffe7fb]
8 /root/ompi/ompi_install/lib/libmpi.so.40(MPI_Barrier+0xa8) [0x7f2898fafea8]
9 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency() [0x4027fe]
10 /lib64/libc.so.6(+0x44e50) [0x7f2898a5ae50]
11 /lib64/libc.so.6(__libc_start_main+0x7c) [0x7f2898a5aefc]
12 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency() [0x4031b5]
=================================
[tst-srv-192:03952] *** Process received signal ***
[tst-srv-192:03952] Signal: Aborted (6)
[tst-srv-192:03952] Signal code: (-6)
[tst-srv-192:03952] [ 0] /lib64/libc.so.6(+0x59db0)[0x7f2898a6fdb0]
[tst-srv-192:03952] [ 1] /lib64/libc.so.6(+0xa642c)[0x7f2898abc42c]
[tst-srv-192:03952] [ 2] /lib64/libc.so.6(raise+0x16)[0x7f2898a6fd06]
[tst-srv-192:03952] [ 3] /lib64/libc.so.6(abort+0xd3)[0x7f2898a427d3]
[tst-srv-192:03952] [ 4] /root/ucx/ucx_install/lib/libucs.so.0(+0x2a167)[0x7f2892c68167]
[tst-srv-192:03952] [ 5] /root/ucx/ucx_install/lib/libucs.so.0(+0x2a239)[0x7f2892c68239]
[tst-srv-192:03952] [ 6] /root/ucx/ucx_install/lib/ucx/libuct_ib.so.0(+0x5e050)[0x7f2892bc2050]
[tst-srv-192:03952] [ 7] /root/ucx/ucx_install/lib/libucs.so.0(+0x21467)[0x7f2892c5f467]
[tst-srv-192:03952] [ 8] /root/ucx/ucx_install/lib/libucp.so.0(ucp_worker_progress+0x2a)[0x7f2892d1d91a]
[tst-srv-192:03952] [ 9] /root/ompi/ompi_install/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0x157)[0x7f2892db2df7]
[tst-srv-192:03952] [10] /root/ompi/ompi_install/lib/libmpi.so.40(ompi_coll_base_barrier_intra_recursivedoubling+0xbb)[0x7f2898ffe7fb]
[tst-srv-192:03952] [11] /root/ompi/ompi_install/lib/libmpi.so.40(MPI_Barrier+0xa8)[0x7f2898fafea8]
[tst-srv-192:03952] [12] /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x4027fe]
[tst-srv-192:03952] [13] /lib64/libc.so.6(+0x44e50)[0x7f2898a5ae50]
[tst-srv-192:03952] [14] /lib64/libc.so.6(__libc_start_main+0x7c)[0x7f2898a5aefc]
[tst-srv-192:03952] [15] /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x4031b5]
[tst-srv-192:03952] *** End of error message ***
主作业正常终止,但返回 1 个进程 非零退出代码。根据用户指示,作业已中止。
安装 OpenMPI、UCX 和 OSU 微基准测试对我来说似乎很好。我尝试删除 ucx 网络设备,以便它自动选择第一个 NIC,但这似乎不起作用。正在寻找有关如何解决此问题的任何指示。
我也遇到了同样类型的错误,你是否找到了这个问题的解决方案@Puneet