我想使用 NLP 来填充文本中的掩码单词,但我不想从所有可能的单词中进行选择,而是想找到两个候选单词中更有可能出现的单词。例如,假设我有一个句子“The [MASK] was 卡在树上”,我想评估“kite”或“bike”是否是更可能的词。
我知道如何使用拥抱脸部的填充蒙版管道找到全局最可能的单词
from transformers import pipeline, AutoTokenizer, AutoModelForMaskedLM
# Define the input sentence with a masked word
input_text = "The [MASK] was stuck in the tree"
# Load the pre-trained model and tokenizer
model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
# Tokenize the input sentence
tokenized_text = tokenizer.tokenize(input_text)
# Use the pipeline to generate a list of predicted words and probabilities
mlm = pipeline("fill-mask", model=model, tokenizer=tokenizer)
results = mlm(input_text)
# Print outputs
for result in results:
token = result["token_str"]
print(f"{token:<15} {result['score']}")
但是,如果“自行车”和“风筝”不在前几个最可能的单词中,这没有帮助。
如何使用 fill-mask 来查找特定掩码的概率?
附注我不确定溢出是否是发布这个问题的最佳位置,似乎没有地方可以提出 nlp 特定问题。
似乎没有直接的方法将两个候选词输入“填充掩码”,但是如果我们深入研究“填充掩码”所基于的底层模型,我们可以找到给定候选词的其他单词的概率.
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
from torch.nn.functional import softmax
# Load the pre-trained model and tokenizer
model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
# Define the input sentence with a masked word
input_text = "The [MASK] was stuck in the tree."
candidates = ["fancy hot dog", "dog", "cat", "bird", "run"]
# Tokenize the input sentence
tokenized_text = tokenizer.tokenize(input_text)
mask_token_index = tokenized_text.index("[MASK]")
# Function to calculate the probability of a candidate
def get_candidate_probability(candidate_tokens):
# Replace the masked token with the candidate tokens
tokenized_candidate = ["[CLS]"] + tokenized_text[:mask_token_index] + candidate_tokens + tokenized_text[mask_token_index + 1:]
# Convert tokenized sentence to input IDs
input_ids = tokenizer.convert_tokens_to_ids(tokenized_candidate)
# Convert input IDs to tensors
input_tensor = torch.tensor([input_ids])
# Get the logits from the model
with torch.no_grad():
logits = model(input_tensor).logits[0]
# Calculate the probability of the candidate word
probs = softmax(logits, dim=-1)
probs = probs[range(len(input_ids)), input_ids]
prob = (
torch.prod(probs[1:mask_token_index+1])
* torch.prod(probs[mask_token_index+len(candidate_tokens)+1:])
)
return prob.item()
# Evaluate the probability of each candidate word
for candidate in candidates:
candidate_tokens = tokenizer.tokenize(candidate)
candidate_probability = get_candidate_probability(candidate_tokens)
print(f"{candidate:<20} {candidate_probability}")
在这里,对于每个候选者,我们用候选者的标记替换掩码,我们获取每个标记的 ID 并将其输入到模型中。该模型输出一个
(batch_size,sentence_size,vocab_size)
,其中第 i,j,k' 个元素对应于第 i 个批次的句子中第 j' 个位置处词汇的第 k' 个标记的非标准化对数概率。我们沿着词汇维度对 logits 进行 softmax,然后选出句子中每个标记的概率。最后,我们将掩码之外的单词的所有概率相乘,以获得给定掩码候选填充的每个单词的概率。
自 版本 3.1.0 起,填充蒙版管道还支持
targets
参数直接在管道中执行此操作:
from transformers import pipeline
pipe = pipeline("fill-mask", model="bert-base-cased")
results = pipe("The [MASK] was stuck in the tree", targets=["bike", "kite"])
for result in results:
print(f'{result["token_str"]}: {result["score"]}')