在 DGX Cloud 上运行时出现与 PyTorch DDP TCP 相关的错误

问题描述 投票:0回答:1

我在 PyTorch 中设置了一个训练循环,并根据 torchrun

 的容错分布式训练添加了对分布式数据并行的支持。我还对我的训练进行了 Docker 化,只要我指定 --ipc=host
,训练循环就可以在“裸机”上的 Ubuntu 服务器上以及在 Docker 容器内运行时毫无问题地运行。
但是,我需要在

DGX Cloud

上运行此训练,当我尝试使用 NGC CLI(使用 ngc batch run)运行相同的训练时,训练失败并收到

RuntimeError
:
RuntimeError: The client socket has failed to connect to any network address of (5094242, 42185). The client socket has failed to connect to 0.77.187.98:42185 (errno: 110 - Connection timed out).

完整的回溯包含在问题页脚中。

“本地”节点(无论是在 Docker 容器内部还是外部运行)和 DGX 上的环境之间的不同行为可能会导致哪些潜在差异,我应该从调查哪一个开始?

DDP 或

torchrun

是否存在与我遇到的错误相关的任何已知问题?

欢迎指点,因为我不是分布式训练或进程间通信方面的专家。

Traceback (most recent call last): File "/workspace/./sushi/train.py", line 179, in main ddp_setup(backend=cfg.dist_backend) File "/workspace/./sushi/train.py", line 179, in main ddp_setup(backend=cfg.dist_backend) File "/workspace/./sushi/train.py", line 172, in ddp_setup distributed.init_process_group(backend=backend) File "/workspace/./sushi/train.py", line 172, in ddp_setup distributed.init_process_group(backend=backend) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 900, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 900, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 245, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 245, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 176, in _create_c10d_store return TCPStore( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 176, in _create_c10d_store return TCPStore( RuntimeError: The client socket has failed to connect to any network address of (5094242, 42185). The client socket has failed to connect to 0.77.187.98:42185 (errno: 110 - Connection timed out). RuntimeError: The client socket has failed to connect to any network address of (5094242, 42185). The client socket has failed to connect to 0.77.187.98:42185 (errno: 110 - Connection timed out).
	
sockets pytorch nvidia distributed-computing
1个回答
0
投票

© www.soinside.com 2019 - 2024. All rights reserved.