获取 ud_ep.c:278 致命:UD 端点 0x22fe520 到 <no debug data>:尝试使用 OpenMPI 和 UCX 进行 OSU 微基准测试时出现未处理的超时错误

问题描述 投票:0回答:1

我有几台带有一些网卡的服务器,我安装了 ompi、ucx 和 osu-microbenchmarks。我正在运行以下命令,

mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^vader,tcp,openib,uct -x UCX_NET_DEVICES=mlx5_1:1 -x UCX_TLS=self,sm,rc_v -x UCX_IB_GID_INDEX=3 -hostfile 主机/usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency

在我的其他网卡设置上运行良好,但在我当前的设置上它给了我一个错误,

        ud_ep.c:278  Fatal: UD endpoint 0x22fe520 to <no debug data>: unhandled timeout error
        ==== backtrace (tid:   4061) ====
         0  /root/ucx/ucx_install/lib/libucs.so.0(ucs_handle_error+0x294) [0x7fdb4153bda4]
         1  /root/ucx/ucx_install/lib/libucs.so.0(ucs_fatal_error_message+0xb2) [0x7fdb41539162]
         2  /root/ucx/ucx_install/lib/libucs.so.0(+0x2a239) [0x7fdb41539239]
         3  /root/ucx/ucx_install/lib/ucx/libuct_ib.so.0(+0x5e050) [0x7fdb41493050]
         4  /root/ucx/ucx_install/lib/libucs.so.0(+0x21467) [0x7fdb41530467]
         5  /root/ucx/ucx_install/lib/libucp.so.0(ucp_worker_progress+0x2a) [0x7fdb415ee91a]
         6  /root/ompi/ompi_install/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0x157) [0x7fdb41683df7]
         7  /root/ompi/ompi_install/lib/libmpi.so.40(ompi_coll_base_barrier_intra_recursivedoubling+0xbb) [0x7fdb438d17fb]
         8  /root/ompi/ompi_install/lib/libmpi.so.40(MPI_Barrier+0xa8) [0x7fdb43882ea8]
         9  /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency() [0x4027fe]
        10  /lib64/libc.so.6(+0x44e50) [0x7fdb4332de50]
        11  /lib64/libc.so.6(__libc_start_main+0x7c) [0x7fdb4332defc]
        12  /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency() [0x4031b5]
        =================================
        [tst-srv-193:04061] *** Process received signal ***
        [tst-srv-193:04061] Signal: Aborted (6)
        [tst-srv-193:04061] Signal code:  (-6)
        [tst-srv-193:04061] [ 0] /lib64/libc.so.6(+0x59db0)[0x7fdb43342db0]
        [tst-srv-193:04061] [ 1] /lib64/libc.so.6(+0xa642c)[0x7fdb4338f42c]
        [tst-srv-193:04061] [ 2] /lib64/libc.so.6(raise+0x16)[0x7fdb43342d06]
        [tst-srv-193:04061] [ 3] /lib64/libc.so.6(abort+0xd3)[0x7fdb433157d3]
        [tst-srv-193:04061] [ 4] /root/ucx/ucx_install/lib/libucs.so.0(+0x2a167)[0x7fdb41539167]
        [tst-srv-193:04061] [ 5] /root/ucx/ucx_install/lib/libucs.so.0(+0x2a239)[0x7fdb41539239]
        [tst-srv-193:04061] [ 6] /root/ucx/ucx_install/lib/ucx/libuct_ib.so.0(+0x5e050)[0x7fdb41493050]
        [tst-srv-193:04061] [ 7] /root/ucx/ucx_install/lib/libucs.so.0(+0x21467)[0x7fdb41530467]
        [tst-srv-193:04061] [ 8] /root/ucx/ucx_install/lib/libucp.so.0(ucp_worker_progress+0x2a)[0x7fdb415ee91a]
        [tst-srv-193:04061] [ 9] /root/ompi/ompi_install/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0x157)[0x7fdb41683df7]
        [tst-srv-193:04061] [10] /root/ompi/ompi_install/lib/libmpi.so.40(ompi_coll_base_barrier_intra_recursivedoubling+0xbb)[0x7fdb438d17fb]
        [tst-srv-193:04061] [11] /root/ompi/ompi_install/lib/libmpi.so.40(MPI_Barrier+0xa8)[0x7fdb43882ea8]
        [tst-srv-193:04061] [12] /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x4027fe]
        [tst-srv-193:04061] [13] /lib64/libc.so.6(+0x44e50)[0x7fdb4332de50]
        [tst-srv-193:04061] [14] /lib64/libc.so.6(__libc_start_main+0x7c)[0x7fdb4332defc]
        [tst-srv-193:04061] [15] /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x4031b5]
        [tst-srv-193:04061] *** End of error message ***
        [tst-srv-192:3952 :0:3952]       ud_ep.c:278  Fatal: UD endpoint 0x24344f0 to <no debug data>: unhandled timeout error
        ==== backtrace (tid:   3952) ====
         0  /root/ucx/ucx_install/lib/libucs.so.0(ucs_handle_error+0x294) [0x7f2892c6ada4]
         1  /root/ucx/ucx_install/lib/libucs.so.0(ucs_fatal_error_message+0xb2) [0x7f2892c68162]
         2  /root/ucx/ucx_install/lib/libucs.so.0(+0x2a239) [0x7f2892c68239]
         3  /root/ucx/ucx_install/lib/ucx/libuct_ib.so.0(+0x5e050) [0x7f2892bc2050]
         4  /root/ucx/ucx_install/lib/libucs.so.0(+0x21467) [0x7f2892c5f467]
         5  /root/ucx/ucx_install/lib/libucp.so.0(ucp_worker_progress+0x2a) [0x7f2892d1d91a]
         6  /root/ompi/ompi_install/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0x157) [0x7f2892db2df7]
         7  /root/ompi/ompi_install/lib/libmpi.so.40(ompi_coll_base_barrier_intra_recursivedoubling+0xbb) [0x7f2898ffe7fb]
         8  /root/ompi/ompi_install/lib/libmpi.so.40(MPI_Barrier+0xa8) [0x7f2898fafea8]
         9  /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency() [0x4027fe]
        10  /lib64/libc.so.6(+0x44e50) [0x7f2898a5ae50]
        11  /lib64/libc.so.6(__libc_start_main+0x7c) [0x7f2898a5aefc]
        12  /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency() [0x4031b5]
        =================================
        [tst-srv-192:03952] *** Process received signal ***
        [tst-srv-192:03952] Signal: Aborted (6)
        [tst-srv-192:03952] Signal code:  (-6)
        [tst-srv-192:03952] [ 0] /lib64/libc.so.6(+0x59db0)[0x7f2898a6fdb0]
        [tst-srv-192:03952] [ 1] /lib64/libc.so.6(+0xa642c)[0x7f2898abc42c]
        [tst-srv-192:03952] [ 2] /lib64/libc.so.6(raise+0x16)[0x7f2898a6fd06]
        [tst-srv-192:03952] [ 3] /lib64/libc.so.6(abort+0xd3)[0x7f2898a427d3]
        [tst-srv-192:03952] [ 4] /root/ucx/ucx_install/lib/libucs.so.0(+0x2a167)[0x7f2892c68167]
        [tst-srv-192:03952] [ 5] /root/ucx/ucx_install/lib/libucs.so.0(+0x2a239)[0x7f2892c68239]
        [tst-srv-192:03952] [ 6] /root/ucx/ucx_install/lib/ucx/libuct_ib.so.0(+0x5e050)[0x7f2892bc2050]
        [tst-srv-192:03952] [ 7] /root/ucx/ucx_install/lib/libucs.so.0(+0x21467)[0x7f2892c5f467]
        [tst-srv-192:03952] [ 8] /root/ucx/ucx_install/lib/libucp.so.0(ucp_worker_progress+0x2a)[0x7f2892d1d91a]
        [tst-srv-192:03952] [ 9] /root/ompi/ompi_install/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0x157)[0x7f2892db2df7]
        [tst-srv-192:03952] [10] /root/ompi/ompi_install/lib/libmpi.so.40(ompi_coll_base_barrier_intra_recursivedoubling+0xbb)[0x7f2898ffe7fb]
        [tst-srv-192:03952] [11] /root/ompi/ompi_install/lib/libmpi.so.40(MPI_Barrier+0xa8)[0x7f2898fafea8]
        [tst-srv-192:03952] [12] /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x4027fe]
        [tst-srv-192:03952] [13] /lib64/libc.so.6(+0x44e50)[0x7f2898a5ae50]
        [tst-srv-192:03952] [14] /lib64/libc.so.6(__libc_start_main+0x7c)[0x7f2898a5aefc]
        [tst-srv-192:03952] [15] /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x4031b5]
        [tst-srv-192:03952] *** End of error message ***

主作业正常终止,但返回 1 个进程 非零退出代码。根据用户指示,作业已中止。

安装 OpenMPI、UCX 和 OSU 微基准测试对我来说似乎很好。我尝试删除 ucx 网络设备,以便它自动选择第一个 NIC,但这似乎不起作用。正在寻找有关如何解决此问题的任何指示。

nvidia openmpi rdma ucx
1个回答
0
投票

我也遇到了同样类型的错误,你是否找到了这个问题的解决方案@Puneet

© www.soinside.com 2019 - 2024. All rights reserved.