我有两个 GCP 虚拟机。在两个虚拟机上,我运行 docker 容器。
我跑步
docker run --gpus all -it --rm --entrypoint /bin/bash -p 8000:8000 -p 7860:7860 -p 29500:29500 lf
我正在尝试llama工厂。
在一个容器中,我运行
FORCE_TORCHRUN=1 NNODES=2 RANK=1 MASTER_ADDR=34.138.7.129 MASTER_PORT=29500 llamafactory-cli train examples/train_lora/llama3_lora_sft_ds3.yaml
,
其中 34.138.7.129 是 vm 的外部 IP
在另一个容器中,我运行
FORCE_TORCHRUN=1 NNODES=2 RANK=0 MASTER_ADDR=34.138.7.129 MASTER_PORT=29500 llamafactory-cli train examples/train_lora/llama3_lora_sft_ds3.yaml
。
但我得到了
[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank1]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank1]: Last error:
[rank1]: socketStartConnect: Connect to 172.17.0.2<49113> failed : Software caused connection abort
E0924 21:26:39.866000 140711615779968 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 484) of binary: /usr/bin/python3.10
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/workspace/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-24_21:26:39
host : 71af1f49abe3
rank : 1 (local_rank: 0)
exitcode : 1 (pid: 484)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
看来pytorch正在使用docker容器ip而不是gcp vm外部ip。
如何解决这个问题?
检查 docker 容器中的
ip addr
:
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
link/ether 22:23:6b:28:6b:e0 brd ff:ff:ff:ff:ff:ff
inet 172.17.42.1/16 scope global docker0
inet6 fe80::a402:65ff:fe86:bba6/64 scope link
valid_lft forever preferred_lft forever
将该 IP 用于您的 Docker 容器。