这些 TORCH_USE_CUDA_DSA 和 freeze_modules 错误意味着什么以及如何修复它们?

问题描述 投票:0回答:1

我正在尝试使用航空图像运行 Mask R-CNN 模型。为了优化这一点,我使用 CUDA 运行所有内容。但这会产生一些错误。这是我的代码:

# Python
import torch
import torchvision
from torchvision.models.detection import MaskRCNN
import gc
import torch.nn as nn
from torchvision.models.detection.rpn import AnchorGenerator
from torch.cuda.amp import GradScaler
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

gc.collect()

torch.cuda.empty_cache()

# Define the model
resnet_net = torchvision.models.resnet18(weights=torchvision.models.ResNet18_Weights.DEFAULT)

modules = list(resnet_net.children())[:-1]
backbone = nn.Sequential(*modules)
backbone.out_channels = 512


# Define the anchor generator
anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
                                   aspect_ratios=((0.5, 1.0, 2.0),))

# Define the model with the configured backbone and anchor generator
model = MaskRCNN(backbone=backbone, num_classes=91, rpn_anchor_generator=anchor_generator)

# Move the model to the GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Define the optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.005, momentum=0.9, weight_decay=0.0005)
scaler = GradScaler()

# Train the model
num_epochs = 5
for epoch in range(num_epochs):
    model.train()
    counter = 0
    for images, height, targets, names in train_ds:
        print(counter)
        counter += 1

        images = list(image.to(device) for image in images)
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
        optimizer.zero_grad()

        with torch.cuda.amp.autocast():
            loss_dict = model(images, targets)
            losses = sum(loss for loss in loss_dict.values())
        
        scaler.scale(losses).backward()
        scaler.step(optimizer)
        scaler.update()

如果我在 GPU 上运行此代码,我有时会收到此错误:

RuntimeError: CUDA error: an illegal memory access was encountered Compile with "TORCH_USE_CUDA_DSA" to enable device-side assertions.

如果我在CPU上运行它,我会得到这个错误:

[error] Disposing session as kernel process died ExitCode: 3221225477, Reason: 0.00s - Debugger warning: It seems that frozen modules are being used, which may 0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off 0.00s - to python to disable frozen modules. 0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.

我之前用这段代码遇到过一些 CUDA 内存问题,这似乎是相关的。这些冻结的模块是什么?关闭它们是否安全?另外,我尝试通过添加以下内容在我的代码中启用此 TORCH_USE_CUDA_DSA:

os.environ["TORCH_USE_CUDA_DSA"] = "1"

但这并没有解决问题。另外,我进行了一次运行,没有遇到任何这些问题,并且代码运行顺利(在 GPU 上)。

python memory pytorch torchvision mask-rcnn
1个回答
0
投票

这里是设备端断言和错误的链接:PyTorch 中的“运行时错误:CUDA 错误:设备端断言已触发”是什么意思?。看起来这里真正的错误是非法内存访问,有时会由于 GPU 上的 CUDA 内存不足而发生。

对于 CPU 情况,这也可能由于 RAM 不足而失败。在此链接中,用户通过在具有更多 RAM 的系统上运行来修复该问题 - https://github.com/microsoft/vscode-jupyter/issues/13678

我会尝试使用更小的模型运行上面的代码,看看是否会产生任何不同类型的错误。或者,如果这是在 Colab 中,请尝试增加可用的 CPU/GPU 内存。

© www.soinside.com 2019 - 2024. All rights reserved.