作为背景,我正在编写一个脚本来训练多个 pytorch 模型。我有一个训练脚本,我希望能够在 gnome 终端中作为子进程运行。这样做的主要原因是我可以随时关注训练进度。如果我可能有多个 GPU,我想在单独的窗口中多次运行我的训练脚本。为了实现这一点,我一直在使用 popen。以下代码用于打开一个新的终端窗口并启动训练脚本
#create a list of commands
commands = []
kd_cfg = KDConfig.read(kd_cfg_path)
cmd = "python scripts/train_torch.py "
for specialist in kd_cfg.specialists:
cmd += f"--config {kd_cfg.runtime_dict['specialists'][specialist]['config']} "
...
# Run each command in a new terminal and store the process object
num_gpus = len(gpus)
free_gpus = copy.deepcopy(gpus)
processes = []
worker_sema = threading.Semaphore(num_gpus)
commands_done = [False for _ in range(len(commands))]
#start the watchdog
watch = threading.Thread(target=watch_dog, args=(processes,free_gpus,commands_done,worker_sema))
watch.start()
for cmd_idx, command in enumerate(commands):
worker_sema.acquire()
gpu = free_gpus.pop()
command += f" --gpu {gpu}" #allocate a free GPU from the list
split_cmd_arr = shlex.split(command)
proc = subprocess.Popen(['gnome-terminal', '--'] + split_cmd_arr)
processes.append( (cmd_idx,gpu,proc) )
我纠结的部分是并发控制。为了保护 GPU 资源,我使用信号量。我的计划是监视启动 GNOME 终端的过程,并在它完成时释放信号量以开始下一个训练过程。相反,所有命令都会同时运行。当我使用两个命令进行测试并限制在一个 GPU 上时,我仍然看到两个终端打开,并且将开始两个训练。在下面的看门狗线程代码中,我看到两个进程都是僵尸进程并且没有子进程,即使我正在观察训练循环在两个终端内部执行而不会崩溃。
# Check if processes are still running
while not all(commands_done):
for cmd_idx, gpu, proc in processes:
# try:
# Check if process is still running
ps_proc = psutil.Process(proc.pid)
#BC we call bash python out of the gate it executes as a child proc
ps_proc_children = get_child_processes(proc.pid)
proc_has_running_children = any(child.is_running for child in ps_proc_children)
print(f"status: {ps_proc.status()}")
print(f"children: {ps_proc_children}")
if proc_has_running_children:
print(f"Process {proc.pid} on GPU {gpu} is still running", end='\r')
else:
print(f"Process {proc.pid} has terminated")
free_gpus.append(gpu)
commands_done[cmd_idx] = True
processes.remove((cmd_idx, gpu, proc))
ps_proc.wait()
print(f"removed proc {ps_proc.pid}")
worker_sema.release()
我想也许子进程基本上启动了另一个进程然后立即返回,但我惊讶地发现也没有子进程。如果有人有任何见解,他们将不胜感激。
如果有帮助,这是看门狗的一些示例输出。
status: zombie
children: []
Process 4076 has terminated
removed proc 4076
status: zombie
children: []
Process 4133 has terminated
removed proc 4133
我认为发生的情况是,在你的看门狗线程中,你在迭代列表时修改列表
processes
。这不能可靠地从列表中删除元素。结果,您的看门狗线程多次检查同一进程的状态并多次释放信号量。
尝试更换for cmd_idx, gpu, proc in processes
和
for cmd_idx, gpu, proc in processes.copy()
。