将 HPC 集群的计算节点从 CentOS 7 升级到 RHEL 8 Linux 发行版 (KeyarchOS 5.8) 后,我收到报告称某些使用 Intel OneAPI 2021.1 编译的软件无法运行,
mpirun
。典型错误如下
[cu345:1485183:0:1485183] Caught signal 8 (Floating point exception: integer divide by zero)
[cu345:1485184:0:1485184] Caught signal 8 (Floating point exception: integer divide by zero)
[cu345:1485185:0:1485185] Caught signal 8 (Floating point exception: integer divide by zero)
[cu345:1485186:0:1485186] Caught signal 8 (Floating point exception: integer divide by zero)
==== backtrace (tid:1485126) ====
0 0x0000000000012ce0 __funlockfile() :0
1 0x0000000000b696ed next_random() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_types.h:1809
2 0x0000000000b696ed impi_bcast_intra_huge() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_bcast.h:667
3 0x0000000000b6630d impi_bcast_intra_heap() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_bcast.h:798
4 0x000000000018ef6d MPIDI_POSIX_mpi_bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/intel/posix_coll.h:124
5 0x000000000017335e MPIDI_SHM_mpi_bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_coll.h:39
6 0x000000000017335e MPIDI_Bcast_intra_composition_alpha() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_impl.h:303
7 0x000000000017335e MPID_Bcast_invoke() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:1726
8 0x000000000017335e MPIDI_coll_invoke() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3356
9 0x0000000000153bee MPIDI_coll_select() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:129
10 0x000000000021c02d MPID_Bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll.h:51
11 0x00000000001386e9 PMPI_Bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/bcast/bcast.c:416
当我尝试重现该问题时,我发现通过测试的可能性约为 20%。在 CentOS 7 上运行时 100% 通过,我不知道会发生什么。
由于回溯日志中有
shm/poxis
,我猜这个问题可能与进程间通信有关。因此,我使用 I_MPI_FABRICS
在不同节点上多次尝试 sbatch --array
的不同选项,并得到以下结果:
MPI 节点# | I_MPI_FABRICS | 通过率 | 注意 |
---|---|---|---|
1 | shm:ofi | ~20% | 默认选项 |
1 | 嘘 | 100% | |
1 | ofi | 100% | |
2 | shm:ofi | 100% | 默认选项 |
2 | 嘘 | 100% | 它会自动回退到 shm:ofi |
2 | ofi | 100% |
虽然我仍然无法找出根本原因,但根据测试结果,最好的解决方法是设置
I_MPI_FABRICS=shm
,因为它在单节点上具有最佳性能,并在多节点上回退到ofi
。