如何在 PyTorch 中实现这个注意力层？

Question

我已经完成了 CNN 部分的实现，一切似乎都运行良好。之后开始实现LSTM部分，如果我没理解错的话，输出形状应该是

(batch_size, 256)

（因为它是双向的，1层，128个单元）。我猜一切都还好。

但我想弄清楚的是如何实现该注意力层。据我了解，它基本上是一个权重张量，将乘以 LSTM 输出，然后应用 softmax 函数并将其输入到最终的线性层。我的问题是：

我理解对了吗？就这么简单吗？
权重张量的大小是多少？
```
(128), (256), (2, 128)
```
还是其他？
如何正确进行张量乘法？在我的第一次尝试中，我将权重张量创建为火炬线性，其 in_features 和 out_features 的值相等（256）。之后，我将
```
torch.mul
```
应用于输入（LSTM 输出）和权重。是这样吗？

这是我尝试实现的注意力层的代码片段：

class Attention_Layer(nn.Module):
    def __init__(self, n_feats: int) -> None:
        super().__init__()
        self.w = nn.Linear(
            in_features=n_feats,
            out_features=n_feats
        )
    
    def forward(self, X: torch.Tensor) -> torch.Tensor:
        w = self.w(X)
        output = F.softmax(torch.mul(X, w), dim=1)
        return output

以及完整模型架构的代码片段：

class Model(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.in_channels = 5
        self.linear_input_features = 1103872
        
        self.cnn = nn.Sequential(
            FLB(
                input_channels=self.in_channels,
                output_channels=64,
                kernel_size=(2, 2)
            ),
            nn.MaxPool2d(kernel_size=(2, 2)),
            FLB(
                input_channels=64,
                output_channels=128,
                kernel_size=(2, 2)
            ),
            FLB(
                input_channels=128,
                output_channels=256,
                kernel_size=(2, 2)
            ),
            nn.MaxPool2d(kernel_size=(2, 2)),
            FLB(
                input_channels=256,
                output_channels=512,
                kernel_size=(2, 2)
            ),
            FLB(
                input_channels=512,
                output_channels=512,
                kernel_size=(2, 2)
            ),
            nn.Flatten(),
            nn.Linear(
                in_features=self.linear_input_features,
                out_features=128
            )
        )
        
        self.lstm = nn.Sequential(
            nn.LSTM(
                input_size=128,
                hidden_size=128,
                num_layers=1,
                batch_first=True,
                bidirectional=True
            ),
            Extract_LSTM_Output()
        )
        
        self.model = nn.Sequential(
            self.cnn,
            self.lstm,
            Attention_Layer(256),
            nn.Linear(
                in_features=256,
                out_features=3
            )
        )
        self.model.apply(weight_init)
    
    def forward(self, X: torch.Tensor) -> torch.Tensor:
        return self.model(X)

Answer 1

你对注意力机制的理解走上了正轨。在 LSTM 模型的上下文中，注意力层实际上是在将 LSTM 输出输入到最终线性层之前为其分配权重。此上下文中的权重是学习参数，可帮助模型专注于输入序列的更相关部分。

关于注意力层的实现，我注意到一些方面可能需要调整。注意力机制通常涉及查询键值框架，即使在这些来自同一来源的自注意力场景中也是如此。这是使用 PyTorch 的注意力层的修订版本，专为自我注意力量身定制：

import torch
import torch.nn as nn
import torch.nn.functional as F

class SelfAttentionLayer(nn.Module):
    def __init__(self, feature_size):
        super(SelfAttentionLayer, self).__init__()
        self.feature_size = feature_size

        # Linear transformations for Q, K, V from the same source
        self.key = nn.Linear(feature_size, feature_size)
        self.query = nn.Linear(feature_size, feature_size)
        self.value = nn.Linear(feature_size, feature_size)

    def forward(self, x, mask=None):
        # Apply linear transformations
        keys = self.key(x)
        queries = self.query(x)
        values = self.value(x)

        # Scaled dot-product attention
        scores = torch.matmul(queries, keys.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.feature_size, dtype=torch.float32))

        # Apply mask (if provided)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        # Apply softmax
        attention_weights = F.softmax(scores, dim=-1)

        # Multiply weights with values
        output = torch.matmul(attention_weights, values)

        return output, attention_weights

此实现提供了一种更标准的自注意力方法，这可能会增强模型关注 LSTM 输出中相关特征的能力。 feature_size 应根据 LSTM 的输出特征进行设置。在您的情况下，如果 LSTM 输出为 (batch_size, 256)，则 feature_size 将为 256。

如何在 PyTorch 中实现这个注意力层？

问题描述投票：0回答：1

1个回答

最新问题

如何在 PyTorch 中实现这个注意力层？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1