slurmd 错误:端口已在使用中,导致从站无法与主站 slurmctld 通信

问题描述 投票:0回答:1

我正在尝试设置一个由 3 个节点组成的 Slurm(版本 22.05.8)集群,这些节点具有这些主机名和本地 IP 地址:

  • 服务器1 - 10.36.17.152
  • 服务器2 - 10.36.17.166
  • 服务器3 - 10.36.17.132

我使用这些资源拼凑了一个最小的工作示例:

有一段时间一切看起来都很好,我能够运行我通常使用的命令来查看一切是否都很好:

srun --label --nodes=3 hostname

用于显示所有 3 台计算机的主机名的预期输出,即:server1、server2 和 server3。

但是 - 在对配置没有进行任何更改之后 - 如果我将节点数指定为大于 1,则该命令将不再起作用,此行为在所有 3 台计算机上都是一致的,“sinfo”的输出也包含在下面:

root@server1:~# srun --nodes=1 hostname
server1
root@server1:~# 
root@server1:~# srun --nodes=3 hostname
srun: Required node not available (down, drained or reserved)
srun: job 312 queued and waiting for resources
^Csrun: Job allocation 312 has been revoked
srun: Force Terminated JobId=312
root@server1:~# 
root@server1:~# ssh server2 "srun --nodes=1 hostname"
server1
root@server1:~# 
root@server1:~# ssh server2 "srun --nodes=3 hostname"
srun: Required node not available (down, drained or reserved)
srun: job 314 queued and waiting for resources
^Croot@server1:~# 
root@server1:~# 
root@server1:~# sinfo
PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
mainPartition*    up   infinite      2  down* server[2-3]
mainPartition*    up   infinite      1   idle server1
root@server1:~#

事实证明,主节点(主机名:server1)上的 slurmctld 和从属节点(主机名:server2 和 server3)上的 slurmd 抛出了一些可能与网络相关的错误:

主节点上的 slurmctld.log 中第一次出现错误之前和之后的几行 - 这是我在日志中注意到的唯一错误类型(pastebin 到整个日志):

root@server1:/var/log# grep -B 20 -A 5 -m1 -i "error" slurmctld.log
[2024-07-26T13:13:49.579] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-07-26T13:13:49.580] debug:  power_save module disabled, SuspendTime < 0
[2024-07-26T13:13:49.580] Running as primary controller
[2024-07-26T13:13:49.580] debug:  No backup controllers, not launching heartbeat.
[2024-07-26T13:13:49.580] debug:  priority/basic: init: Priority BASIC plugin loaded
[2024-07-26T13:13:49.580] No parameter for mcs plugin, default values set
[2024-07-26T13:13:49.580] mcs: MCSParameters = (null). ondemand set.
[2024-07-26T13:13:49.580] debug:  mcs/none: init: mcs none plugin loaded
[2024-07-26T13:13:49.580] debug2: slurmctld listening on 0.0.0.0:6817
[2024-07-26T13:13:52.662] debug:  hash/k12: init: init: KangarooTwelve hash plugin loaded
[2024-07-26T13:13:52.662] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0
[2024-07-26T13:13:52.662] debug:  gres/gpu: init: loaded
[2024-07-26T13:13:52.662] debug:  validate_node_specs: node server1 registered with 0 jobs
[2024-07-26T13:13:52.662] debug2: _slurm_rpc_node_registration complete for server1 usec=229
[2024-07-26T13:13:53.586] debug:  Spawning registration agent for server[2-3] 2 hosts
[2024-07-26T13:13:53.586] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2024-07-26T13:13:53.586] debug:  sched: Running job scheduler for default depth.
[2024-07-26T13:13:53.586] debug2: Spawning RPC agent for msg_type REQUEST_NODE_REGISTRATION_STATUS
[2024-07-26T13:13:53.587] debug2: Tree head got back 0 looking for 2
[2024-07-26T13:13:53.588] debug2: _slurm_connect: failed to connect to 10.36.17.166:6818: Connection refused
[2024-07-26T13:13:53.588] debug2: Error connecting slurm stream socket at 10.36.17.166:6818: Connection refused
[2024-07-26T13:13:53.588] debug2: _slurm_connect: failed to connect to 10.36.17.132:6818: Connection refused
[2024-07-26T13:13:53.588] debug2: Error connecting slurm stream socket at 10.36.17.132:6818: Connection refused
[2024-07-26T13:13:54.588] debug2: _slurm_connect: failed to connect to 10.36.17.166:6818: Connection refused
[2024-07-26T13:13:54.588] debug2: Error connecting slurm stream socket at 10.36.17.166:6818: Connection refused
[2024-07-26T13:13:54.589] debug2: _slurm_connect: failed to connect to 10.36.17.132:6818: Connection refused

与 10.36.17.166:6818 和 10.36.17.132:6818 的连接被拒绝。这些是由 slurm.conf 中的“SlurmdPort”键指定的端口

两个从节点上的 slurmd.log 文件中也存在类似的错误:

server2 上的 slurmd.log,错误仅出现在文件末尾(

pastebin 到整个日志):

root@server2:/var/log# tail -5 slurmd.log [2024-07-26T13:13:53.018] debug: mpi/pmix_v4: init: PMIx plugin loaded [2024-07-26T13:13:53.018] debug: mpi/pmix_v4: init: PMIx plugin loaded [2024-07-26T13:13:53.018] debug2: No mpi.conf file (/etc/slurm/mpi.conf) [2024-07-26T13:13:53.018] error: Error binding slurm stream socket: Address already in use [2024-07-26T13:13:53.018] error: Unable to bind listen port (6818): Address already in use
server3 上的 
slurmd.log(

将整个日志粘贴到):

root@server3:/var/log# tail -5 slurmd.log [2024-07-26T13:13:53.383] debug: mpi/pmix_v4: init: PMIx plugin loaded [2024-07-26T13:13:53.383] debug: mpi/pmix_v4: init: PMIx plugin loaded [2024-07-26T13:13:53.383] debug2: No mpi.conf file (/etc/slurm/mpi.conf) [2024-07-26T13:13:53.384] error: Error binding slurm stream socket: Address already in use [2024-07-26T13:13:53.384] error: Unable to bind listen port (6818): Address already in use
每当我更改任何配置时,我都会使用此脚本重新启动 slurm,这些操作的执行顺序是否会导致我面临的问题:

#! /bin/bash scp /etc/slurm/slurm.conf /etc/slurm/gres.conf server2:/etc/slurm/ && echo copied slurm.conf and gres.conf to server2; scp /etc/slurm/slurm.conf /etc/slurm/gres.conf server3:/etc/slurm/ && echo copied slurm.conf and gres.conf to server3; echo echo restarting slurmctld and slurmd on server1 (scontrol shutdown ; sleep 3 ; rm -f /var/log/slurmd.log /var/log/slurmctld.log ; slurmctld -d ; sleep 3 ; slurmd) && echo done echo restarting slurmd on server2 (ssh server2 "rm -f /var/log/slurmd.log /var/log/slurmctld.log ; slurmd") && echo done echo restarting slurmd on server3 (ssh server3 "rm -f /var/log/slurmd.log /var/log/slurmctld.log ; slurmd") && echo done
配置文件:

slurm.conf 没有注释:

root@server1:/etc/slurm# grep -v "#" slurm.conf ClusterName=DlabCluster SlurmctldHost=server1 GresTypes=gpu ProctrackType=proctrack/linuxproc ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=root StateSaveLocation=/var/spool/slurmctld TaskPlugin=task/affinity,task/cgroup InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 SchedulerType=sched/backfill SelectType=select/cons_tres JobCompType=jobcomp/none JobAcctGatherFrequency=30 SlurmctldDebug=debug2 SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=debug2 SlurmdLogFile=/var/log/slurmd.log NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:1 PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP
gres.conf:

root@server1:/etc/slurm# cat gres.conf NodeName=server1 Name=gpu File=/dev/nvidia0 NodeName=server2 Name=gpu File=/dev/nvidia0 NodeName=server3 Name=gpu File=/dev/nvidia0
这些配置文件在所有 3 台计算机上都是相同的。

作为 Linux 和 Slurm 管理的完全初学者,我一直在努力理解甚至最基本的文档,而且我无法在网上找到答案。任何帮助将不胜感激。

谢谢!

cluster-computing slurm
1个回答
0
投票
我也遇到同样的问题。大约一周了,我无法弄清楚。我有 6 个 slurm 节点。 2 正在寻找工作。但 4 个给出了同样的错误:

sudo slurmd -Dvvvv 给我以下错误 slurmd:错误:绑定 slurm 流套接字时出错:地址已在使用中 slurmd:致命:无法绑定侦听端口(6818):地址已在使用中

slurmd 服务正在运行。

sudo lsof -i :6818 命令 PID 用户 FD 类型 设备大小/关闭 节点名称 slurmd 1312724 root 5u IPv4 12836747 0t0 TCP *:6818(监听)

如果我终止进程,错误会消失几分钟,并且节点处于空闲状态。 但几分钟后,节点就宕机了。

我运行了 sudo KillAll slurmd 但没有成功 除控制器和集群上的第一个工作节点外,所有 4 个节点的行为方式均相同。

此后我添加了一个端口范围,认为它存在端口冲突 scontrol 显示配置 | grep SrunPortRange SrunPortRange = 60001-63000

所以有东西阻塞了端口或进程

我尝试了上面所有的建议,但没有帮助

© www.soinside.com 2019 - 2024. All rights reserved.