由于 cgroup.conf 解析错误,SLURM slurmd 服务无法在 Raspberry Pi 5 集群上启动

问题描述 投票:0回答:1

我有一个 Raspberry Pi 5 集群,设置了一个主节点和一个工作节点。我在主节点上成功安装了 SLURM,目前正在尝试配置 slurmd 守护进程以在工作节点上运行。 问题

配置 SLURM 后,我使用以下命令在主节点上启用并启动了 slurmd 服务:

sudo systemctl enable slurmd
sudo systemctl start slurmd
sudo systemctl status slurmd

但是,slurmd 服务无法启动,并出现以下错误消息:

    × slurmd.service - Slurm node daemon
        Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; preset: enabled)
       Active: failed (Result: exit-code) since Sat 2024-10-26 23:03:46 CEST; 24min ago
     Duration: 5ms
       Docs: man:slurmd(8)
    Process: 2026 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
   Main PID: 2026 (code=exited, status=1/FAILURE)
        CPU: 5ms

Oct 26 23:03:46 master systemd[1]: Started slurmd.service - Slurm node daemon.
Oct 26 23:03:46 master slurmd[2026]: slurmd: error: _parse_next_key: Parsing error at unrecognized key: TaskA>
Oct 26 23:03:46 master slurmd[2026]: slurmd: fatal: Could not open/read/parse cgroup.conf file /etc/slurm/cgr>
Oct 26 23:03:46 master slurmd[2026]: error: _parse_next_key: Parsing error at unrecognized key: TaskAffinity
Oct 26 23:03:46 master slurmd[2026]: fatal: Could not open/read/parse cgroup.conf file /etc/slurm/cgroup.conf
Oct 26 23:03:46 master systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
Oct 26 23:03:46 master systemd[1]: slurmd.service: Failed with result 'exit-code'.

我当前的cgroup.conf如下:

CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"
AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf"
ConstrainCores=no
TaskAffinity=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=no
ConstrainDevices=no
AllowedRamSpace=100
AllowedSwapSpace=0
MaxRAMPercent=100
MaxSwapPercent=100
MinRAMSpace=30

问题

How do I correct the errors in cgroup.conf that lead to the parsing errors mentioned in the logs?
Are there specific configurations required for SLURM to work correctly with Raspberry Pi 5 and its architecture?
What are the common causes for the high latency error reported in SLURM, and how can I address them?

任何指导或建议将不胜感激!

验证 Munge 正常运行:

ssh pi@node01 munge -n

Checked the status of the slurmctld service on the master node, which is also reported as down.
Investigated the cgroup.conf for parsing errors.
linux cluster-computing slurm raspberry-pi5 debian-bookworm
1个回答
0
投票

错误消息是不言自明的:

错误:_parse_next_key:无法识别的键处解析错误:TaskAffinity

这不是该文件的有效配置选项。请参阅 cgroup 配置文件的文档中的有效选项列表。

它在 Slurm 21.08.2 版本中被删除并替换为

TaskPlugin=cgroup,affinity"

slurm.conf

© www.soinside.com 2019 - 2024. All rights reserved.