这些 TORCH_USE_CUDA_DSA 和 freeze_modules 错误意味着什么以及如何修复它们？

Question

我正在尝试使用航空图像运行 Mask R-CNN 模型。为了优化这一点，我使用 CUDA 运行所有内容。但这会产生一些错误。这是我的代码：

# Python
import torch
import torchvision
from torchvision.models.detection import MaskRCNN
import gc
import torch.nn as nn
from torchvision.models.detection.rpn import AnchorGenerator
from torch.cuda.amp import GradScaler
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

gc.collect()

torch.cuda.empty_cache()

# Define the model
resnet_net = torchvision.models.resnet18(weights=torchvision.models.ResNet18_Weights.DEFAULT)

modules = list(resnet_net.children())[:-1]
backbone = nn.Sequential(*modules)
backbone.out_channels = 512


# Define the anchor generator
anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
                                   aspect_ratios=((0.5, 1.0, 2.0),))

# Define the model with the configured backbone and anchor generator
model = MaskRCNN(backbone=backbone, num_classes=91, rpn_anchor_generator=anchor_generator)

# Move the model to the GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Define the optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.005, momentum=0.9, weight_decay=0.0005)
scaler = GradScaler()

# Train the model
num_epochs = 5
for epoch in range(num_epochs):
    model.train()
    counter = 0
    for images, height, targets, names in train_ds:
        print(counter)
        counter += 1

        images = list(image.to(device) for image in images)
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
        optimizer.zero_grad()

        with torch.cuda.amp.autocast():
            loss_dict = model(images, targets)
            losses = sum(loss for loss in loss_dict.values())
        
        scaler.scale(losses).backward()
        scaler.step(optimizer)
        scaler.update()

如果我在 GPU 上运行此代码，我有时会收到此错误：

RuntimeError: CUDA error: an illegal memory access was encountered Compile with "TORCH_USE_CUDA_DSA" to enable device-side assertions.

如果我在CPU上运行它，我会得到这个错误：

[error] Disposing session as kernel process died ExitCode: 3221225477, Reason: 0.00s - Debugger warning: It seems that frozen modules are being used, which may 0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off 0.00s - to python to disable frozen modules. 0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.

我之前用这段代码遇到过一些 CUDA 内存问题，这似乎是相关的。这些冻结的模块是什么？关闭它们是否安全？另外，我尝试通过添加以下内容在我的代码中启用此 TORCH_USE_CUDA_DSA：

os.environ["TORCH_USE_CUDA_DSA"] = "1"

但这并没有解决问题。另外，我进行了一次运行，没有遇到任何这些问题，并且代码运行顺利（在 GPU 上）。

Answer 1

这里是设备端断言和错误的链接：PyTorch 中的“运行时错误：CUDA 错误：设备端断言已触发”是什么意思？。看起来这里真正的错误是非法内存访问，有时会由于 GPU 上的 CUDA 内存不足而发生。

对于 CPU 情况，这也可能由于 RAM 不足而失败。在此链接中，用户通过在具有更多 RAM 的系统上运行来修复该问题 - https://github.com/microsoft/vscode-jupyter/issues/13678。

我会尝试使用更小的模型运行上面的代码，看看是否会产生任何不同类型的错误。或者，如果这是在 Colab 中，请尝试增加可用的 CPU/GPU 内存。

这些 TORCH_USE_CUDA_DSA 和 freeze_modules 错误意味着什么以及如何修复它们？

问题描述投票：0回答：1

1个回答

最新问题

这些 TORCH_USE_CUDA_DSA 和 freeze_modules 错误意味着什么以及如何修复它们？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1