我的目标是使用 pbs qsub 命令来运行 python 脚本。我在pbs脚本“#PBS -lnodes=cu25:ppn=32+cu26:ppn=32”中分配了2个节点,并希望python脚本中的“subprocess.call(command)”可以在这两个节点上运行。但是,我收到错误“cu25.114979hfi_userinit:assign_context命令失败:设备或资源繁忙cu25.114979hfp_gen1_context_open:hfi_userinit:失败,重试(1/3)”。
以下是pbs脚本run.pbs的详细信息:
#!/bin/bash
#PBS -N test-cpu
#PBS -l nodes=cu25:ppn=32+cu26:ppn=32
#PBS -l walltime=720:00:00
#PBS -q batch
#PBS -j oe
cd $PBS_O_WORKDIR
NPROCS=`wc -l < $PBS_NODEFILE`
python3 run.py
我的Python文件run.py:
import subprocess
command = "mpirun vasp_std"
with open('vasp.out', 'w') as f:
subprocess.call(command, shell=True, stdout=f, cwd=None)
我使用qsub run.pbs来提交我的工作,并希望命令“mpirun vasp_std”可以在分配的两个节点上运行。但是,发生了以下错误:
cu25.115070PSM2 can't open hfi unit: -1 (err=23)
cu26.99380PSM2 can't open hfi unit: -1 (err=23)
cu26.100027PSM2 can't open hfi unit: -1 (err=23)
cu25.114979PSM2 can't open hfi unit: -1 (err=23)
cu25.114926PSM2 can't open hfi unit: -1 (err=23)
cu25.115487PSM2 can't open hfi unit: -1 (err=23)
cu25.114932PSM2 can't open hfi unit: -1 (err=23)
[mpiexec@cu25] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@cu25] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:253): unable to write data to proxy
[mpiexec@cu25] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:175): unable to send signal downstream
[mpiexec@cu25] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@cu25] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:507): error waiting for event
[mpiexec@cu25] main (../../ui/mpich/mpiexec.c:1148): process manager error waiting for completion
[mpiexec@cu25] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
如果我只使用一个节点,效果很好。如何在多个节点上运行脚本? 预先感谢。