RHEL 8 上的 intelmpi 2021.1 错误,错误整数除以零

问题描述 投票:0回答:1

将 HPC 集群的计算节点从 CentOS 7 升级到 RHEL 8 Linux 发行版 (KeyarchOS 5.8) 后,我收到报告称某些使用 Intel OneAPI 2021.1 编译的软件无法运行,

mpirun
。典型错误如下

[cu345:1485183:0:1485183] Caught signal 8 (Floating point exception: integer divide by zero)
[cu345:1485184:0:1485184] Caught signal 8 (Floating point exception: integer divide by zero)
[cu345:1485185:0:1485185] Caught signal 8 (Floating point exception: integer divide by zero)
[cu345:1485186:0:1485186] Caught signal 8 (Floating point exception: integer divide by zero)
==== backtrace (tid:1485126) ====
 0 0x0000000000012ce0 __funlockfile()  :0
 1 0x0000000000b696ed next_random()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_types.h:1809
 2 0x0000000000b696ed impi_bcast_intra_huge()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_bcast.h:667
 3 0x0000000000b6630d impi_bcast_intra_heap()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_bcast.h:798
 4 0x000000000018ef6d MPIDI_POSIX_mpi_bcast()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/intel/posix_coll.h:124
 5 0x000000000017335e MPIDI_SHM_mpi_bcast()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_coll.h:39
 6 0x000000000017335e MPIDI_Bcast_intra_composition_alpha()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_impl.h:303
 7 0x000000000017335e MPID_Bcast_invoke()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:1726
 8 0x000000000017335e MPIDI_coll_invoke()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3356
 9 0x0000000000153bee MPIDI_coll_select()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:129
10 0x000000000021c02d MPID_Bcast()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll.h:51
11 0x00000000001386e9 PMPI_Bcast()  /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/bcast/bcast.c:416

当我尝试重现该问题时,我发现通过测试的可能性约为 20%。在 CentOS 7 上运行时 100% 通过,我不知道会发生什么。

linux mpi intel-oneapi intel-mpi
1个回答
0
投票

由于回溯日志中有

shm/poxis
,我猜这个问题可能与进程间通信有关。因此,我使用
I_MPI_FABRICS
在不同节点上多次尝试
sbatch --array
的不同选项,并得到以下结果:

MPI 节点# I_MPI_FABRICS 通过率 注意
1 shm:ofi ~20% 默认选项
1 100%
1 ofi 100%
2 shm:ofi 100% 默认选项
2 100% 它会自动回退到 shm:ofi
2 ofi 100%

虽然我仍然无法找出根本原因,但根据测试结果,最好的解决方法是设置

I_MPI_FABRICS=shm
,因为它在单节点上具有最佳性能,并在多节点上回退到
ofi

© www.soinside.com 2019 - 2024. All rights reserved.