我正在尝试使用航空图像运行 Mask R-CNN 模型。为了优化这一点,我使用 CUDA 运行所有内容。但这会产生一些错误。这是我的代码:
# Python
import torch
import torchvision
from torchvision.models.detection import MaskRCNN
import gc
import torch.nn as nn
from torchvision.models.detection.rpn import AnchorGenerator
from torch.cuda.amp import GradScaler
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
gc.collect()
torch.cuda.empty_cache()
# Define the model
resnet_net = torchvision.models.resnet18(weights=torchvision.models.ResNet18_Weights.DEFAULT)
modules = list(resnet_net.children())[:-1]
backbone = nn.Sequential(*modules)
backbone.out_channels = 512
# Define the anchor generator
anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
aspect_ratios=((0.5, 1.0, 2.0),))
# Define the model with the configured backbone and anchor generator
model = MaskRCNN(backbone=backbone, num_classes=91, rpn_anchor_generator=anchor_generator)
# Move the model to the GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
# Define the optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.005, momentum=0.9, weight_decay=0.0005)
scaler = GradScaler()
# Train the model
num_epochs = 5
for epoch in range(num_epochs):
model.train()
counter = 0
for images, height, targets, names in train_ds:
print(counter)
counter += 1
images = list(image.to(device) for image in images)
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
optimizer.zero_grad()
with torch.cuda.amp.autocast():
loss_dict = model(images, targets)
losses = sum(loss for loss in loss_dict.values())
scaler.scale(losses).backward()
scaler.step(optimizer)
scaler.update()
如果我在 GPU 上运行此代码,我有时会收到此错误:
RuntimeError: CUDA error: an illegal memory access was encountered Compile with "TORCH_USE_CUDA_DSA" to enable device-side assertions.
如果我在CPU上运行它,我会得到这个错误:
[error] Disposing session as kernel process died ExitCode: 3221225477, Reason: 0.00s - Debugger warning: It seems that frozen modules are being used, which may 0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off 0.00s - to python to disable frozen modules. 0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
我之前用这段代码遇到过一些 CUDA 内存问题,这似乎是相关的。这些冻结的模块是什么?关闭它们是否安全?另外,我尝试通过添加以下内容在我的代码中启用此 TORCH_USE_CUDA_DSA:
os.environ["TORCH_USE_CUDA_DSA"] = "1"
但这并没有解决问题。另外,我进行了一次运行,没有遇到任何这些问题,并且代码运行顺利(在 GPU 上)。
这里是设备端断言和错误的链接:PyTorch 中的“运行时错误:CUDA 错误:设备端断言已触发”是什么意思?。看起来这里真正的错误是非法内存访问,有时会由于 GPU 上的 CUDA 内存不足而发生。
对于 CPU 情况,这也可能由于 RAM 不足而失败。在此链接中,用户通过在具有更多 RAM 的系统上运行来修复该问题 - https://github.com/microsoft/vscode-jupyter/issues/13678。
我会尝试使用更小的模型运行上面的代码,看看是否会产生任何不同类型的错误。或者,如果这是在 Colab 中,请尝试增加可用的 CPU/GPU 内存。