多标签分类的批量大小不一致问题:输入[32, 3, 224, 224],输出[98, 129]

问题描述 投票:0回答:1

我正在使用 Hugging Face 中的 ResNet18 来微调多标签数据集。我希望能够预测图像的 3 个相应标签,为此我创建了 3 个完全连接的层。首先,我尝试更新ResNet18的分类器层:

model2.classifier_artist = torch.nn.Sequential(
    torch.nn.Dropout(p=0.2, inplace=True), 
    torch.nn.Linear(in_features=512, out_features=num_classes_artist, bias=True)
).to(device)

model2.classifier_style = torch.nn.Sequential(
    torch.nn.Dropout(p=0.2, inplace=True), 
    torch.nn.Linear(in_features=512, out_features=num_classes_style, bias=True) 
).to(device)

model2.classifier_genre = torch.nn.Sequential(
    torch.nn.Dropout(p=0.2, inplace=True), 
    torch.nn.Linear(in_features=512, out_features=num_classes_genre, bias=True) 
).to(device)

num_classes_artist = 129
num_classes_style = 27
num_classes_genre = 11

但这没有用。模型架构不包括我添加的 3 个分类器。以下是 torchinfo 的摘要:

ResNetForImageClassification (ResNetForImageClassification)                 [32, 3, 224, 224]    [32, 1000]
├─ResNetModel (resnet)                                                      [32, 3, 224, 224]    [32, 512, 1, 1]
│    └─ResNetEmbeddings (embedder)                                          [32, 3, 224, 224]    [32, 64, 56, 56]
│    │    └─ResNetConvLayer (embedder)                                      [32, 3, 224, 224]    [32, 64, 112, 112]
│    │    └─MaxPool2d (pooler)                                              [32, 64, 112, 112]   [32, 64, 56, 56]
│    └─ResNetEncoder (encoder)                                              [32, 64, 56, 56]     [32, 512, 7, 7]
│    │    └─ModuleList (stages)                                             --                   --
│    └─AdaptiveAvgPool2d (pooler)                                           [32, 512, 7, 7]      [32, 512, 1, 1]
├─Sequential (classifier)                                                   [32, 512, 1, 1]      [32, 1000]
│    └─Flatten (0)                                                          [32, 512, 1, 1]      [32, 512]
│    └─Linear (1)                                                           [32, 512]            [32, 1000]

之后我开始在 PyTorch 中实现它:

class WikiartModel(nn.Module):
    def __init__(self, num_artists, num_genres, num_styles):
        super(WikiartModel, self).__init__()
        
        # Shared Convolutional Layers
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, padding =1)
        self.conv2 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(128, 256, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        
        # Artist classification branch
        self.fc_artist1 = nn.Linear(256 * 16 * 16, 512)
        self.fc_artist2 = nn.Linear(512, num_artists)
        
        
        # Genre classification branch
        self.fc_genre1 = nn.Linear(256 * 16 *  16, 512)
        self.fc_genre2 = nn.Linear(512, num_genres)


        # Style classification branch
        self.fc_style1 = nn.Linear(256 * 16 * 16, 512) 
        self.fc_style2 = nn.Linear(512, num_styles)
        
    def forward(self, x):
        # Shared convolutional layers
        x = self.pool(F.relu(self.conv1(x)))   
        x = self.pool(F.relu(self.conv2(x)))       
        x = self.pool(F.relu(self.conv3(x)))
        x = x.view(-1, 256 * 16  * 16) 

        # Artist classification branch
        artists_out = F.relu(self.fc_artist1(x))
        artists_out = self.fc_artist2(artists_out)
       
        
        # Genre classification branch
        genre_out = F.relu(self.fc_genre1(x))
        genre_out = self.fc_genre2(genre_out) 
        
        
        # Style classification branch 
        style_out = F.relu(self.fc_style1(x))
        style_out = self.fc_style2(style_out)
        
        return artists_out, genre_out, style_out
    
# Set the number of classes for each task
num_artists = 129  # Including "Unknown Artist"
num_genres = 11    # Including "Unknown Genre"
num_styles = 27

这是 torchinfo 摘要:

Layer (type (var_name))                  Input Shape          Output Shape
================================================================================
WikiartModel (WikiartModel)              [32, 3, 224, 224]    [98, 129]
├─Conv2d (conv1)                         [32, 3, 224, 224]    [32, 64, 224, 224]
├─MaxPool2d (pool)                       [32, 64, 224, 224]   [32, 64, 112, 112]
├─Conv2d (conv2)                         [32, 64, 112, 112]   [32, 128, 112, 112]
├─MaxPool2d (pool)                       [32, 128, 112, 112]  [32, 128, 56, 56]
├─Conv2d (conv3)                         [32, 128, 56, 56]    [32, 256, 56, 56]
├─MaxPool2d (pool)                       [32, 256, 56, 56]    [32, 256, 28, 28]
├─Linear (fc_artist1)                    [98, 65536]          [98, 512]
├─Linear (fc_artist2)                    [98, 512]            [98, 129]
├─Linear (fc_genre1)                     [98, 65536]          [98, 512]
├─Linear (fc_genre2)                     [98, 512]            [98, 11]
├─Linear (fc_style1)                     [98, 65536]          [98, 512]
├─Linear (fc_style2)                     [98, 512]            [98, 27]

输入数据的批量大小

([32, 3, 224, 224])
和模型输出预测的批量大小
([98, 129])
似乎不同。我已经检查了我的数据加载、模型架构和训练循环,但我似乎无法确定这个问题的根本原因。这种不一致会导致计算训练循环内的损失时出现错误:

loss_artist = criterion_artist(outputs_artist, labels_artist)

ValueError: Expected input batch_size (98) to match target batch_size (32).

python deep-learning pytorch huggingface image-classification
1个回答
0
投票

在您的模型架构中,您定义了 3 种类型的 2D 卷积层 (

self.conv1
self.conv2
self.conv3
) 和一个最大池化层 (
self.pool
)。

对于输入张量大小

[32, 3, 224, 224]
,经过指定层后:

self.conv1 = nn.Conv2d(3, 64, kernel_size=3, padding =1)
self.conv2 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
self.conv3 = nn.Conv2d(128, 256, kernel_size=3, padding=1)
self.pool = nn.MaxPool2d(2, 2)

基于您的输入张量大小 -

[32, 3, 224, 224]
,图像通过后

x = self.pool(F.relu(self.conv1(x)))   
x = self.pool(F.relu(self.conv2(x)))       
x = self.pool(F.relu(self.conv3(x))) 

x
的大小将变为
[32, 256, 28, 28]

现在,您想在后续模块中使用全连接网络。为此,您应该替换该行:

x = x.view(-1, 256 * 16  * 16) 

x = x.view(x.size(0),-1) 

此修改将张量展平为形状

[32, 200704]
(256 * 28 * 28 = 200704)。因此,您应该调整:

nn.Linear(256 * 16 * 16, 512)

nn.Linear(256 * 28 * 28, 512)

修改后的

WikiartModel
类如下:

class WikiartModel(nn.Module):
    def __init__(self, num_artists, num_genres, num_styles):
        super(WikiartModel, self).__init__()
        
        # Shared Convolutional Layers
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, padding =1)
        self.conv2 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(128, 256, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.size = 28
        # Artist classification branch
        self.fc_artist1 = nn.Linear(256 * self.size * self.size, 512)
        self.fc_artist2 = nn.Linear(512, num_artists)
        
        
        # Genre classification branch
        self.fc_genre1 = nn.Linear(256 * self.size *  self.size, 512)
        self.fc_genre2 = nn.Linear(512, num_genres)


        # Style classification branch
        self.fc_style1 = nn.Linear(256 * self.size * self.size, 512) 
        self.fc_style2 = nn.Linear(512, num_styles)
        
    def forward(self, x):
        # Shared convolutional layers
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = self.pool(F.relu(self.conv3(x)))
        x = x.view(x.size(0),-1) 
        # Artist classification branch
        artists_out = F.relu(self.fc_artist1(x))
        artists_out = self.fc_artist2(artists_out)
       
        
        # Genre classification branch
        genre_out = F.relu(self.fc_genre1(x))
        genre_out = self.fc_genre2(genre_out) 
        
        
        # Style classification branch 
        style_out = F.relu(self.fc_style1(x))
        style_out = self.fc_style2(style_out)
        
        return artists_out, genre_out, style_out
    
 # Set the number of classes for each task
 num_artists = 129  # Including "Unknown Artist"
 num_genres = 11    # Including "Unknown Genre"
 num_styles = 27
© www.soinside.com 2019 - 2024. All rights reserved.