在 128x128 图像上训练 ResNet-18 时,RTX 4060 出现 CUDA 内存不足错误

问题描述 投票:0回答:1

我在使用 RTX 4060 GPU 在 128x128 图像上训练 ResNet-18 模型时遇到 CUDA 内存不足错误。尽管减少了批量大小,我仍然面临这个问题。以下是我的设置和代码的详细信息:

设置:

  • GPU: NVIDIA RTX 4060,带 8GB VRAM

  • 框架:PyTorch

  • 模型:ResNet-18(预训练)

  • 图像尺寸: 128x128

  • 批量大小:4(最初尝试8、16、32)

  • 积累步骤:8

  • 优化器: Adam

  • 损失函数:CrossEntropyLoss

  • Windows: 11

import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, random_split
import torch.nn as nn
import torch.optim as optim
from torch.cuda.amp import GradScaler, autocast

# Define transformations
transform = transforms.Compose([
    transforms.Resize((128, 128)),  # Reduce image size
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Load dataset
data_dir = 'OSA_16_07/OSA'
dataset = torchvision.datasets.ImageFolder(root=data_dir, transform=transform)

# Split the dataset into training and validation sets
train_size = int(0.8 * len(dataset))  # 80% for training
val_size = len(dataset) - train_size  # 20% for validation
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

# Create data loaders with smaller batch size
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)  # Reduced batch size
val_loader = DataLoader(val_dataset, batch_size=4, shuffle=False)  # Reduced batch size

# Load pre-trained ResNet-18 model
model = torchvision.models.resnet18(pretrained=True)

# Replace the final fully connected layer
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, 11)

# Move the model to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Initialize GradScaler for mixed precision training
scaler = GradScaler()

# Training loop with gradient accumulation
num_epochs = 10
accumulation_steps = 8  # Number of steps to accumulate gradients
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    optimizer.zero_grad()
    
    for i, (inputs, labels) in enumerate(train_loader):
        inputs, labels = inputs.to(device), labels.to(device)
        
        with autocast():
            outputs = model(inputs)
            loss = criterion(outputs, labels) / accumulation_steps
        
        scaler.scale(loss).backward()
        
        if (i + 1) % accumulation_steps == 0:
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()
        
        running_loss += loss.item() * accumulation_steps
    
    print(f"Epoch {epoch+1}/{num_epochs}, Training Loss: {running_loss/len(train_loader)}")
    
    # Validation step
    model.eval()
    val_loss = 0.0
    correct = 0
    total = 0
    with torch.no_grad():
        for inputs, labels in val_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            
            with autocast():
                outputs = model(inputs)
                loss = criterion(outputs, labels)
            
            val_loss += loss.item()
            
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    
    val_accuracy = 100 * correct / total
    print(f"Validation Loss: {val_loss/len(val_loader)}, Validation Accuracy: {val_accuracy}%")
    
    # Clear GPU cache
    torch.cuda.empty_cache()

采取的步骤:

  1. 批量大小减少至 4。

  2. 在每个纪元后使用

    torch.cuda.empty_cache()
    清除 GPU 缓存。

  3. 使用

    nvidia-smi
    验证没有其他进程正在消耗 GPU 内存。

问题:

  1. 使用此设置,具有 8GB VRAM 的 RTX 4060 遇到内存不足错误是否正常?

  2. 我可以使用任何其他策略来有效管理 GPU 内存吗?

  3. 我的梯度累积实现正确吗?

任何解决此问题的见解或建议将不胜感激。谢谢!

pytorch gpu resnet torchvision
1个回答
0
投票

Resnet18 是一个相对较小的网络,会导致 GPU 内存不足问题。您能否分享有关您正在使用的数据的更多详细信息?

© www.soinside.com 2019 - 2024. All rights reserved.