如何使用BERT预测空字符串的概率

Question

假设我们有一个这样的模板句子：

“____房子是我们的聚会场所。”

我们有一个形容词列表来填补空白，例如：

“黄色”
“大”
“”

请注意，其中之一是空字符串。

目标是比较在给定句子上下文的情况下选择最有可能描述“房子”的单词的概率。如果更有可能“什么都没有”，也应该考虑到这一点。我们可以预测每个单词填空的概率，但是我们如何预测没有形容词来描述“房子”的可能性呢？

预测单词的概率：

from transformers import BertTokenizer, BertForMaskedLM import torch from torch.nn import functional as F # Load BERT tokenizer and pre-trained model tokenizer = BertTokenizer.from_pretrained('bert-large-uncased') model = BertForMaskedLM.from_pretrained('bert-large-uncased', return_dict=True) targets = ["yellow", "large"] sentence = "The [MASK] house is our meeting place." # Using BERT, compute probability over its entire vocabulary, returning logits input = tokenizer.encode_plus(sentence, return_tensors = "pt") mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)[0] with torch.no_grad(): output = model(**input) # Run softmax over the logits to get the probabilities softmax = F.softmax(output.logits[0], dim=-1) # Find the words' probabilities in this probability distribution target_probabilities = {t: softmax[mask_index, tokenizer.vocab[t]].numpy()[0] for t in targets} target_probabilities

这会输出单词及其相关概率的列表：

{'yellow': 0.0061520976, 'large': 0.00071377633}

如果我尝试向列表中添加空字符串，则会收到以下错误：

--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-62-6f726220a108> in <module> 18 19 # Find the words' probabilities in this probability distribution ---> 20 target_probabilities = {t: softmax[mask_index, tokenizer.vocab[t]].numpy()[0] for t in targets} 21 target_probabilities <ipython-input-62-6f726220a108> in <dictcomp>(.0) 18 19 # Find the words' probabilities in this probability distribution ---> 20 target_probabilities = {t: softmax[mask_index, tokenizer.vocab[t]].numpy()[0] for t in targets} 21 target_probabilities KeyError: ''

这是因为 BERT 的词汇表不包含空字符串，所以我们无法查找模型中不存在的事物的概率。

我们应该如何得到没有单词可以填空的概率呢？模型可以做到这一点吗？使用空标记

[PAD]

而不是空字符串是否有意义？（我只见过

[PAD]

用在句子末尾，使一组句子的长度相同。）

Answer 1

首先，我应该说，当你对 BERT 中的 logits 分数使用 softmax 时，它们并不是真正的概率。但这似乎是一个可以接受的方法。所以，我也会用它。

其次，您还应该考虑形容词有多个标记的情况。我的解决方案还解决了多个令牌的问题。

这是代码修复：

targets = ["", "yellow", "large", "very large"] target_log_P = {t: None for t in targets} for target in target_log_P: input = tokenizer.encode_plus(sentence.replace("[MASK]", target), return_tensors = "pt") output = model(**input) target_log_P[target] = sum([ torch.log(F.softmax(output.logits[0][i], dim=-1)[idx]) for i, idx in enumerate(input['input_ids'][0]) ]).item()

也许有一个管道，我这里的解决方案不是标准方法，但它似乎有效......

结果如下：

>>> target_log_P {'': -37.5234375, 'yellow': -37.08171463012695, 'large': -35.85972213745117, 'very large': -46.483154296875}

如何使用BERT预测空字符串的概率

问题描述投票：0回答：2

2个回答

最新问题

如何使用BERT预测空字符串的概率

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2