我在带有 rtx 3090ti github_code 的计算机上运行此代码。但是,代码会在第一个前向层引发错误。虽然,代码在 cpu 上成功运行。 堆栈跟踪:
Traceback (most recent call last):
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
cli.main()
File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
run()
File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
runpy.run_path(target, run_name="__main__")
File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 322, in run_path
pkg_name=pkg_name, script_name=fname)
File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 136, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
exec(code, run_globals)
File "/home/tekre/Desktop/video_captioning_studies/HMN/main.py", line 37, in <module>
model = train_fn(cfgs, cfgs.model_name, model, hungary_matcher, train_loader, valid_loader, device)
File "/home/tekre/Desktop/video_captioning_studies/HMN/train.py", line 66, in train_fn
preds, objects_pending, action_pending, video_pending = model(objects, object_masks, feature2ds, feature3ds, numberic_caps)
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/tekre/Desktop/video_captioning_studies/HMN/models/caption_models/hierarchical_model.py", line 95, in forward
objects_feats, action_feats, video_feats, objects_semantics, action_semantics, video_semantics = self.forward_encoder(objects_feats, objects_mask, feature2ds, feature3ds)
File "/home/tekre/Desktop/video_captioning_studies/HMN/models/caption_models/hierarchical_model.py", line 57, in forward_encoder
objects_feats, objects_semantics = self.entity_level(feature2ds, feature3ds, objects, objects_mask)
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/tekre/Desktop/video_captioning_studies/HMN/models/encoders/entity_level.py", line 53, in forward
features_2d = self.feature2d_proj(features_2d.view(-1, features_2d.shape[-1]))
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 107, in forward
exponential_average_factor, self.eps)
File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/functional.py", line 1670, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
我按照 github 仓库的指示安装了我的环境。我是否需要另外安装 cudnn 包,因为 pytorch 在环境中处理它。我把这个问题放在这里是因为那里没有太多回应。
在我的例子中,这个错误是由于我使用的 Cuda 版本 (11.7) 与安装的 Cuda pytorch 版本不匹配引起的。
从 Pytorch 安装页面 安装正确的版本后,我能够运行代码。
在我的例子中,当它点击
loss.backward()
调用时,我的 GPU 内存用完了。我正在运行一个LSTM
架构。据我所知,pytorch
目前无法区分内存不足的错误。我使用 nvidia-smi
检查了 GPU 使用情况,并注意到 RuntimeError 之前的尖峰。
要进一步调试问题,请参阅讨论here.