如何使用 slurm 运行 NVSHMEM

问题描述 投票:0回答:1

我正在开始使用 NVSHMEM,我想从一个简单的示例开始,但没有取得太大成功。

#include <nvshmem.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
    // Initialize the NVSHMEM library
    nvshmem_init();

    int mype = nvshmem_my_pe();
    int npes = nvshmem_n_pes();

    fprintf(stdout, "PE %d of %d has started ...\n", mype, npes);

    // end shmem
    nvshmem_finalize();

    return 0;
}

使用以下 sbatch 文件运行:

#!/bin/bash -l
#SBATCH --nodes=2                          # number of nodes
#SBATCH --ntasks=8                         # number of tasks
#SBATCH --ntasks-per-node=4                # number of tasks per node
#SBATCH --gpus-per-task=1                  # number of gpu per task
#SBATCH --cpus-per-task=1                  # number of cores per task
#SBATCH --time=00:15:00                    # time (HH:MM:SS)
#SBATCH --partition=gpu                    # partition
#SBATCH --account=p200301                  # project account
#SBATCH --qos=default                      # SLURM qos

module load NCCL OpenMPI CUDA NVSHMEM && nvcc -rdc=true -ccbin g++ -I $NVSHMEM_HOME/include test.cu -o test -L $NVSHMEM_HOME/lib -lnvshmem_host -lnvshmem_device -lucs -lucp && srun -n 8 ./test

预期的输出类似于:

PE 0 of 8 has started ...
PE 1 of 8 has started ...
PE 2 of 8 has started ...
.....

我得到的输出是:

PE 0 of 1 has started ...
PE 0 of 1 has started ...
PE 0 of 1 has started ...
PE 0 of 1 has started ...
PE 0 of 1 has started ...
PE 0 of 1 has started ...
PE 0 of 1 has started ...
PE 0 of 1 has started ...

我认为我错过了一些重要但简单的东西,有人可以启发我吗?

nvidia hpc multi-gpu
1个回答
0
投票

你有没有弄清楚这一点?我遇到了完全相同的问题。

© www.soinside.com 2019 - 2024. All rights reserved.