如何使用Roberta计算最后4个隐藏层的加权和？

Question

这篇论文中的表格解释了获得嵌入的各种方法，我认为这些方法也适用于 Roberta：

我正在尝试使用 Roberta 计算最后 4 个隐藏层的加权和来获得令牌嵌入，但我不知道这是否是正确的方法，这是我尝试过的代码：

from transformers import RobertaTokenizer, RobertaModel
import torch

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')
caption = ['this is a yellow bird', 'example caption']

tokens = tokenizer(caption, return_tensors='pt', padding=True)

input_ids = tokens['input_ids']
attention_mask = tokens['attention_mask']

output = model(input_ids, attention_mask, output_hidden_states=True)

states = output.hidden_states
token_emb = torch.stack([states[i] for i in [-4, -3, -2, -1]]).sum(0).squeeze()

Answer 1

首先，让我们对 OG BERT 代码进行一些挖掘，https://github.com/google-research/bert

如果我们在 github 存储库上快速搜索“sum”，我们会找到这个 https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/modeling.py#L814

  # The Transformer performs sum residuals on all layers so the input needs
  # to be the same as the hidden size.
  if input_width != hidden_size:
    raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
                     (input_width, hidden_size))

然后在 Stackoverflow 上快速搜索发现如何获取 HuggingFace Transformers 库中预训练 BERT 模型的中间层输出？

现在，让我们通过稍微向后操作来验证您的代码逻辑是否有效：

from transformers import RobertaTokenizer, RobertaModel
import torch

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')
caption = ['this is a yellow bird', 'example caption']

tokens = tokenizer(caption, return_tensors='pt', padding=True)

input_ids = tokens['input_ids']
attention_mask = tokens['attention_mask']

output = model(input_ids, attention_mask, output_hidden_states=True)

然后：

>>>len(output.hidden_states)

出：

为什么是13？

1 个嵌入层输出 + 12 个编码器（隐藏）层输出

import torchinfo
torchinfo.summary(model)

[出]：

================================================================================
Layer (type:depth-idx)                                  Param #
================================================================================
RobertaModel                                            --
├─RobertaEmbeddings: 1-1                                --
│    └─Embedding: 2-1                                   38,603,520
│    └─Embedding: 2-2                                   394,752
│    └─Embedding: 2-3                                   768
│    └─LayerNorm: 2-4                                   1,536
│    └─Dropout: 2-5                                     --
├─RobertaEncoder: 1-2                                   --
│    └─ModuleList: 2-6                                  --
│    │    └─RobertaLayer: 3-1                           7,087,872
│    │    └─RobertaLayer: 3-2                           7,087,872
│    │    └─RobertaLayer: 3-3                           7,087,872
│    │    └─RobertaLayer: 3-4                           7,087,872
│    │    └─RobertaLayer: 3-5                           7,087,872
│    │    └─RobertaLayer: 3-6                           7,087,872
│    │    └─RobertaLayer: 3-7                           7,087,872
│    │    └─RobertaLayer: 3-8                           7,087,872
│    │    └─RobertaLayer: 3-9                           7,087,872
│    │    └─RobertaLayer: 3-10                          7,087,872
│    │    └─RobertaLayer: 3-11                          7,087,872
│    │    └─RobertaLayer: 3-12                          7,087,872
├─RobertaPooler: 1-3                                    --
│    └─Linear: 2-7                                      590,592
│    └─Tanh: 2-8                                        --
================================================================================
Total params: 124,645,632
Trainable params: 124,645,632
Non-trainable params: 0
================================================================================

让我们检查一下每层输出的大小：

first_hidden_shape = output.hidden_states[0].shape

for x in output.hidden_states:
  assert x.shape == first_hidden_shape

检查一下！

>>>first_hidden_shape

[出]：

torch.Size([2, 7, 768])

为什么

[2, 7, 768]

？

是

(batch_size, sequence_length, hidden_size)

2 个句子 = 批量大小为 2
7 最长序列长度 = 否。标记数（即最长例句中的 5 个 + 和，`len(input_ids[0])
768 个输出 = 所有隐藏层输出固定

Bíddu aðeins！（等一下！），这是否意味着我每批都有

sequence_length * 768

输出？如果我的批次长度不相等，则输出大小会不同？

是的，这是正确的！为了获得所有输入的“平等”感，如果您仍要使用基于特征的 BERT 方法，最好将所有输出填充/截断为固定长度。

Soooo，我的

torch.stack

方法对吗？

是的，看起来是这样：

torch.stack(output.hidden_states[-4:]).sum(0)

小毛病，您可以通过切片访问

output.hidden_states

，因为它是一个元组对象。接下来，您不需要挤压堆叠的输出，因为最外层张量非空。

这是堆栈的一种特殊情况，在 NLP 中，第一个维度是批量大小，第二个维度是标记长度，因此当您没有明确说明要堆栈的维度时，对隐藏维度求和最终会得到相同的结果。

更明确一点：

# 2nd dimension is where our hidden states are 
# and that's where we want to do our sum too.
torch.stack(output.hidden_states[-4:], dim=2).sum(2)

但实际上，你可以这样做来安慰自己：

assert(
 True for x in torch.flatten(
    torch.stack(output.hidden_states[-4:], dim=2).sum(2)
    == 
    torch.stack(output.hidden_states[-4:], dim=2).sum(2)
  )
)

有趣，“连接最后四个隐藏”怎么样？

在连接的情况下，你需要在维度中明确

>>> torch.cat(output.hidden_states[-4:], dim=2).shape

[出]：

torch.Size([2, 7, 3072])

但请注意，您仍然以

sequence_length * hidden_size * 4

结尾，这使得长度不等的批次变得很痛苦。

你为什么不直接回答“是的，你是对的”？

如果我这样做了，那么这会比你自己证明上述内容更能让你信服吗？

如何使用Roberta计算最后4个隐藏层的加权和？

问题描述投票：0回答：1

1个回答

为什么是13？

让我们检查一下每层输出的大小：

为什么
`[2, 7, 768]`
？

Bíddu aðeins！（等一下！），这是否意味着我每批都有
`sequence_length * 768`
输出？如果我的批次长度不相等，则输出大小会不同？

Soooo，我的
`torch.stack`
方法对吗？

有趣，“连接最后四个隐藏”怎么样？

你为什么不直接回答“是的，你是对的”？

最新问题

如何使用Roberta计算最后4个隐藏层的加权和？

问题描述 投票：0回答：1

1个回答

为什么是13？

让我们检查一下每层输出的大小：

为什么[2, 7, 768]？

Bíddu aðeins！ （等一下！），这是否意味着我每批都有 sequence_length * 768 输出？如果我的批次长度不相等，则输出大小会不同？

Soooo，我的torch.stack方法对吗？

有趣，“连接最后四个隐藏”怎么样？

你为什么不直接回答“是的，你是对的”？

最新问题

问题描述投票：0回答：1

为什么
`[2, 7, 768]`
？

Bíddu aðeins！（等一下！），这是否意味着我每批都有
`sequence_length * 768`
输出？如果我的批次长度不相等，则输出大小会不同？

Soooo，我的
`torch.stack`
方法对吗？