Python 子进程：如何在多个 GPU 上顺序启动命令？

Question

我有一台带有 N 个 GPU 的 Linux 服务器，其中 N 是 2 的倍数（2、4 或 8）。我还有一个 Python 文件，我想通过

CUDA_VISIBLE_DEVICES

使用 GPU 启动它。通常，我会创建多个

tmux

会话并启动命令：

# Each command below is run in a separate tmux session
CUDA_VISIBLE_DEVICES=0 python main.py --var1 1
CUDA_VISIBLE_DEVICES=1 python main.py --var1 2
CUDA_VISIBLE_DEVICES=2 python main.py --var1 3

管理会话很麻烦，所以我实现了一段代码来帮助我顺序启动这些命令：

import os
import subprocess

commands = [
'python main.py --var1 1', 
'python main.py --var1 2', 
'python main.py --var1 3'
] 
# List of available GPUs
try:
    available_gpus = os.environ['CUDA_VISIBLE_DEVICES'].split(',')
    available_gpus = [str(gpu_id) for gpu_id in available_gpus]
except KeyError:
    # If the CUDA_VISIBLE_DEVICES environment variable is not set, use all GPUs
    available_gpus = [str(x) for x in range(torch.cuda.device_count())]

n_gpus = len(available_gpus)
procs_by_gpu = [None]*n_gpus

# Iterate over the commands and launch them in parallel
while len(commands) > 0:
    for idx, gpu_idx in enumerate(available_gpus):
        proc = procs_by_gpu[idx]
        if (proc is None) or (proc.poll() is not None):
            # Nothing is running on this GPU; launch a command.
            cmd = commands.pop(0)
            new_proc = subprocess.Popen(
                f'CUDA_VISIBLE_DEVICES={gpu_idx} {cmd}', shell=True)
            procs_by_gpu[idx] = new_proc
            break
    time.sleep(1)

# Wait for the last few tasks to finish before returning
for p in procs_by_gpu:
    if p is not None:
        p.wait()

说明：

代码会先获取可运行的GPU列表
如果 GPU 空闲，那么它将在该 GPU 上运行来自
```
commands.pop(0)
```
的命令
如果没有空闲的 GPU，那么它将等待直到一个 GPU 空闲，然后它将在该 GPU 上启动下一个命令
进程一直运行到没有命令剩余，然后结束程序。

我的问题：我想扩展这段代码，但是如果我想为每个进程使用两个或更多 GPU，由 Agument Parser

--gpus

控制，如

--gpus 2

，然后使用

args.gpus

在代码中。基本上，用

tmux

方法，它会是这样的：

# Each command below is run in a separate tmux session
CUDA_VISIBLE_DEVICES=0,1 python main.py --var1 1
CUDA_VISIBLE_DEVICES=2,3 python main.py --var1 2

我应该如何扩展前一个以处理每个命令的多个 GPU？可能会出现以下情况：

在有 4 个可用 GPU 和
```
args.gpus=4
```
但只有一个命令的情况下，代码在所有 4 个 GPU 上运行该命令，然后退出。
在有4个可用GPU和
```
args.gpus=4
```
，但有两个或更多命令的情况下，代码在所有4个GPU上顺序运行命令，然后退出（相当于
```
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --var1 1;CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --var1 2
```
在有 4 个可用 GPU 和
```
args.gpus=2
```
但只有一个命令的情况下，代码在前 2 个可用 GPU 上运行该命令，然后退出（相当于
```
CUDA_VISIBLE_DEVICES=0,1 ptyhon main.py --var1 1
```
.
在有 4 个可用 GPU 和
```
args.gpus=2
```
，但有两个或更多 命令的情况下，代码顺序运行这些命令，每个命令使用两个 GPU，下一个命令在当前命令之一之后启动运行命令完成（例如，命令 #1 和 #2 正在运行，然后命令 #2 完成，然后命令 #3 在命令 #2 之后立即启动并使用命令 #2 的 GPU）。这个相当于使用了两个
```
tmux
```
会话，并且在每个命令中指定了两个GPU。

Python 子进程：如何在多个 GPU 上顺序启动命令？

问题描述投票：0回答：0

最新问题

Python 子进程：如何在多个 GPU 上顺序启动命令？

问题描述 投票：0回答：0

最新问题

问题描述投票：0回答：0