我在使用 RTX 4060 GPU 在 128x128 图像上训练 ResNet-18 模型时遇到 CUDA 内存不足错误。尽管减少了批量大小,我仍然面临这个问题。以下是我的设置和代码的详细信息:
设置:
GPU: NVIDIA RTX 4060,带 8GB VRAM
框架:PyTorch
模型:ResNet-18(预训练)
图像尺寸: 128x128
批量大小:4(最初尝试8、16、32)
积累步骤:8
优化器: Adam
损失函数:CrossEntropyLoss
Windows: 11
import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, random_split
import torch.nn as nn
import torch.optim as optim
from torch.cuda.amp import GradScaler, autocast
# Define transformations
transform = transforms.Compose([
transforms.Resize((128, 128)), # Reduce image size
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
# Load dataset
data_dir = 'OSA_16_07/OSA'
dataset = torchvision.datasets.ImageFolder(root=data_dir, transform=transform)
# Split the dataset into training and validation sets
train_size = int(0.8 * len(dataset)) # 80% for training
val_size = len(dataset) - train_size # 20% for validation
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])
# Create data loaders with smaller batch size
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True) # Reduced batch size
val_loader = DataLoader(val_dataset, batch_size=4, shuffle=False) # Reduced batch size
# Load pre-trained ResNet-18 model
model = torchvision.models.resnet18(pretrained=True)
# Replace the final fully connected layer
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, 11)
# Move the model to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Initialize GradScaler for mixed precision training
scaler = GradScaler()
# Training loop with gradient accumulation
num_epochs = 10
accumulation_steps = 8 # Number of steps to accumulate gradients
for epoch in range(num_epochs):
model.train()
running_loss = 0.0
optimizer.zero_grad()
for i, (inputs, labels) in enumerate(train_loader):
inputs, labels = inputs.to(device), labels.to(device)
with autocast():
outputs = model(inputs)
loss = criterion(outputs, labels) / accumulation_steps
scaler.scale(loss).backward()
if (i + 1) % accumulation_steps == 0:
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
running_loss += loss.item() * accumulation_steps
print(f"Epoch {epoch+1}/{num_epochs}, Training Loss: {running_loss/len(train_loader)}")
# Validation step
model.eval()
val_loss = 0.0
correct = 0
total = 0
with torch.no_grad():
for inputs, labels in val_loader:
inputs, labels = inputs.to(device), labels.to(device)
with autocast():
outputs = model(inputs)
loss = criterion(outputs, labels)
val_loss += loss.item()
_, predicted = torch.max(outputs, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
val_accuracy = 100 * correct / total
print(f"Validation Loss: {val_loss/len(val_loader)}, Validation Accuracy: {val_accuracy}%")
# Clear GPU cache
torch.cuda.empty_cache()
采取的步骤:
批量大小减少至 4。
在每个纪元后使用
torch.cuda.empty_cache()
清除 GPU 缓存。
使用
nvidia-smi
验证没有其他进程正在消耗 GPU 内存。
问题:
使用此设置,具有 8GB VRAM 的 RTX 4060 遇到内存不足错误是否正常?
我可以使用任何其他策略来有效管理 GPU 内存吗?
我的梯度累积实现正确吗?
任何解决此问题的见解或建议将不胜感激。谢谢!
Resnet18 是一个相对较小的网络,会导致 GPU 内存不足问题。您能否分享有关您正在使用的数据的更多详细信息?