我在 PyTorch 中设置了一个训练循环,并根据 torchrun
--ipc=host
,训练循环就可以在“裸机”上的 Ubuntu 服务器上以及在 Docker 容器内运行时毫无问题地运行。但是,我需要在 DGX Cloud
上运行此训练,当我尝试使用 NGC CLI(使用 ngc batch run
)运行相同的训练时,训练失败并收到
RuntimeError
:RuntimeError: The client socket has failed to connect to any network address of (5094242, 42185). The client socket has failed to connect to 0.77.187.98:42185 (errno: 110 - Connection timed out).
完整的回溯包含在问题页脚中。
“本地”节点(无论是在 Docker 容器内部还是外部运行)和 DGX 上的环境之间的不同行为可能会导致哪些潜在差异,我应该从调查哪一个开始?
DDP 或
torchrun
是否存在与我遇到的错误相关的任何已知问题?
欢迎指点,因为我不是分布式训练或进程间通信方面的专家。
Traceback (most recent call last):
File "/workspace/./sushi/train.py", line 179, in main
ddp_setup(backend=cfg.dist_backend)
File "/workspace/./sushi/train.py", line 179, in main
ddp_setup(backend=cfg.dist_backend)
File "/workspace/./sushi/train.py", line 172, in ddp_setup
distributed.init_process_group(backend=backend)
File "/workspace/./sushi/train.py", line 172, in ddp_setup
distributed.init_process_group(backend=backend)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 900, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 900, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 245, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 245, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 176, in _create_c10d_store
return TCPStore(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 176, in _create_c10d_store
return TCPStore(
RuntimeError: The client socket has failed to connect to any network address of (5094242, 42185). The client socket has failed to connect to 0.77.187.98:42185 (errno: 110 - Connection timed out).
RuntimeError: The client socket has failed to connect to any network address of (5094242, 42185). The client socket has failed to connect to 0.77.187.98:42185 (errno: 110 - Connection timed out).