我有一个 Raspberry Pi 5 集群,设置了一个主节点和一个工作节点。我在主节点上成功安装了 SLURM,目前正在尝试配置 slurmd 守护进程以在工作节点上运行。 问题
配置 SLURM 后,我使用以下命令在主节点上启用并启动了 slurmd 服务:
sudo systemctl enable slurmd
sudo systemctl start slurmd
sudo systemctl status slurmd
但是,slurmd 服务无法启动,并出现以下错误消息:
× slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Sat 2024-10-26 23:03:46 CEST; 24min ago
Duration: 5ms
Docs: man:slurmd(8)
Process: 2026 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
Main PID: 2026 (code=exited, status=1/FAILURE)
CPU: 5ms
Oct 26 23:03:46 master systemd[1]: Started slurmd.service - Slurm node daemon.
Oct 26 23:03:46 master slurmd[2026]: slurmd: error: _parse_next_key: Parsing error at unrecognized key: TaskA>
Oct 26 23:03:46 master slurmd[2026]: slurmd: fatal: Could not open/read/parse cgroup.conf file /etc/slurm/cgr>
Oct 26 23:03:46 master slurmd[2026]: error: _parse_next_key: Parsing error at unrecognized key: TaskAffinity
Oct 26 23:03:46 master slurmd[2026]: fatal: Could not open/read/parse cgroup.conf file /etc/slurm/cgroup.conf
Oct 26 23:03:46 master systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
Oct 26 23:03:46 master systemd[1]: slurmd.service: Failed with result 'exit-code'.
我当前的cgroup.conf如下:
CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"
AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf"
ConstrainCores=no
TaskAffinity=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=no
ConstrainDevices=no
AllowedRamSpace=100
AllowedSwapSpace=0
MaxRAMPercent=100
MaxSwapPercent=100
MinRAMSpace=30
问题
How do I correct the errors in cgroup.conf that lead to the parsing errors mentioned in the logs?
Are there specific configurations required for SLURM to work correctly with Raspberry Pi 5 and its architecture?
What are the common causes for the high latency error reported in SLURM, and how can I address them?
任何指导或建议将不胜感激!
验证 Munge 正常运行:
ssh pi@node01 munge -n
Checked the status of the slurmctld service on the master node, which is also reported as down.
Investigated the cgroup.conf for parsing errors.
错误消息是不言自明的:
错误:_parse_next_key:无法识别的键处解析错误:TaskAffinity
这不是该文件的有效配置选项。请参阅 cgroup 配置文件的文档中的有效选项列表。
它在 Slurm 21.08.2 版本中被删除并替换为
TaskPlugin=cgroup,affinity"
在
slurm.conf
。