在 python 中打开 gnome 终端会立即显示为僵尸

Question

作为背景，我正在编写一个脚本来训练多个 pytorch 模型。我有一个训练脚本，我希望能够在 gnome 终端中作为子进程运行。这样做的主要原因是我可以随时关注训练进度。如果我可能有多个 GPU，我想在单独的窗口中多次运行我的训练脚本。为了实现这一点，我一直在使用 popen。以下代码用于打开一个新的终端窗口并启动训练脚本

#create a list of commands
commands = []
kd_cfg = KDConfig.read(kd_cfg_path)
cmd = "python scripts/train_torch.py "
for specialist in kd_cfg.specialists:
    cmd += f"--config {kd_cfg.runtime_dict['specialists'][specialist]['config']} "
    ...

# Run each command in a new terminal and store the process object
num_gpus = len(gpus)
free_gpus = copy.deepcopy(gpus)
processes = []
worker_sema = threading.Semaphore(num_gpus)
commands_done = [False for _ in range(len(commands))]

#start the watchdog
watch = threading.Thread(target=watch_dog, args=(processes,free_gpus,commands_done,worker_sema))
watch.start()

for cmd_idx, command in enumerate(commands):

    worker_sema.acquire()

    gpu = free_gpus.pop()
    command += f" --gpu {gpu}" #allocate a free GPU from the list
    split_cmd_arr = shlex.split(command)
    proc = subprocess.Popen(['gnome-terminal', '--'] + split_cmd_arr)

    processes.append( (cmd_idx,gpu,proc) )

我纠结的部分是并发控制。为了保护 GPU 资源，我使用信号量。我的计划是监视启动 GNOME 终端的过程，并在它完成时释放信号量以开始下一个训练过程。相反，所有命令都会同时运行。当我使用两个命令进行测试并限制在一个 GPU 上时，我仍然看到两个终端打开，并且将开始两个训练。在下面的看门狗线程代码中，我看到两个进程都是僵尸进程并且没有子进程，即使我正在观察训练循环在两个终端内部执行而不会崩溃。

   # Check if processes are still running
    while not all(commands_done):
        for cmd_idx, gpu, proc in processes:
            # try:
            # Check if process is still running
            ps_proc = psutil.Process(proc.pid)

            #BC we call bash python out of the gate it executes as a child proc
            ps_proc_children = get_child_processes(proc.pid)
            proc_has_running_children = any(child.is_running for child in ps_proc_children)

            print(f"status: {ps_proc.status()}")
            print(f"children:  {ps_proc_children}")
            if proc_has_running_children:
                print(f"Process {proc.pid} on GPU {gpu} is still running", end='\r')
            else:
                print(f"Process {proc.pid} has terminated")
                free_gpus.append(gpu)
                commands_done[cmd_idx] = True
                processes.remove((cmd_idx, gpu, proc))

                ps_proc.wait()
                print(f"removed proc {ps_proc.pid}")
                worker_sema.release()

我想也许子进程基本上启动了另一个进程然后立即返回，但我惊讶地发现也没有子进程。如果有人有任何见解，他们将不胜感激。

如果有帮助，这是看门狗的一些示例输出。

status: zombie
children:  []
Process 4076 has terminated
removed proc 4076
status: zombie
children:  []
Process 4133 has terminated
removed proc 4133

Answer 1

我认为发生的情况是，在你的看门狗线程中，你在迭代列表时修改列表

processes

。这不能可靠地从列表中删除元素。结果，您的看门狗线程多次检查同一进程的状态并多次释放信号量。尝试更换

for cmd_idx, gpu, proc in processes

和

for cmd_idx, gpu, proc in processes.copy()

。

在 python 中打开 gnome 终端会立即显示为僵尸

问题描述投票：0回答：1

1个回答

最新问题

在 python 中打开 gnome 终端会立即显示为僵尸

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1