如何使用有限的可能输入进行 NLP 填充掩模

问题描述 投票:0回答:2

我想使用 NLP 来填充文本中的掩码单词,但我不想从所有可能的单词中进行选择,而是想找到两个候选单词中更有可能出现的单词。例如,假设我有一个句子“The [MASK] was 卡在树上”,我想评估“kite”或“bike”是否是更可能的词。

我知道如何使用拥抱脸部的填充蒙版管道找到全局最可能的单词

from transformers import pipeline, AutoTokenizer, AutoModelForMaskedLM

# Define the input sentence with a masked word
input_text = "The [MASK] was stuck in the tree"

# Load the pre-trained model and tokenizer
model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Tokenize the input sentence
tokenized_text = tokenizer.tokenize(input_text)

# Use the pipeline to generate a list of predicted words and probabilities
mlm = pipeline("fill-mask", model=model, tokenizer=tokenizer)
results = mlm(input_text)

# Print outputs
for result in results:
    token = result["token_str"]
    print(f"{token:<15} {result['score']}")

但是,如果“自行车”和“风筝”不在前几个最可能的单词中,这没有帮助。

如何使用 fill-mask 来查找特定掩码的概率?

附注我不确定溢出是否是发布这个问题的最佳位置,似乎没有地方可以提出 nlp 特定问题。

nlp mask huggingface
2个回答
0
投票

似乎没有直接的方法将两个候选词输入“填充掩码”,但是如果我们深入研究“填充掩码”所基于的底层模型,我们可以找到给定候选词的其他单词的概率.

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
from torch.nn.functional import softmax

# Load the pre-trained model and tokenizer
model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Define the input sentence with a masked word
input_text = "The [MASK] was stuck in the tree."
candidates = ["fancy hot dog", "dog", "cat", "bird", "run"]

# Tokenize the input sentence
tokenized_text = tokenizer.tokenize(input_text)
mask_token_index = tokenized_text.index("[MASK]")

# Function to calculate the probability of a candidate
def get_candidate_probability(candidate_tokens):

    # Replace the masked token with the candidate tokens
    tokenized_candidate = ["[CLS]"] + tokenized_text[:mask_token_index] + candidate_tokens + tokenized_text[mask_token_index + 1:]

    # Convert tokenized sentence to input IDs
    input_ids = tokenizer.convert_tokens_to_ids(tokenized_candidate)

    # Convert input IDs to tensors
    input_tensor = torch.tensor([input_ids])

    # Get the logits from the model
    with torch.no_grad():
        logits = model(input_tensor).logits[0]

    # Calculate the probability of the candidate word
    probs = softmax(logits, dim=-1)
    probs = probs[range(len(input_ids)), input_ids]
    prob = (
        torch.prod(probs[1:mask_token_index+1])
        * torch.prod(probs[mask_token_index+len(candidate_tokens)+1:])
    )

    return prob.item()

# Evaluate the probability of each candidate word
for candidate in candidates:
    candidate_tokens = tokenizer.tokenize(candidate)
    candidate_probability = get_candidate_probability(candidate_tokens)
    print(f"{candidate:<20} {candidate_probability}")

在这里,对于每个候选者,我们用候选者的标记替换掩码,我们获取每个标记的 ID 并将其输入到模型中。该模型输出一个

(batch_size,sentence_size,vocab_size)
,其中第 i,j,k' 个元素对应于第 i 个批次的句子中第 j' 个位置处词汇的第 k' 个标记的非标准化对数概率。我们沿着词汇维度对 logits 进行 softmax,然后选出句子中每个标记的概率。最后,我们将掩码之外的单词的所有概率相乘,以获得给定掩码候选填充的每个单词的概率。


0
投票

版本 3.1.0 起,填充蒙版管道还支持

targets
参数直接在管道中执行此操作:

from transformers import pipeline

pipe = pipeline("fill-mask", model="bert-base-cased")
results = pipe("The [MASK] was stuck in the tree", targets=["bike", "kite"])

for result in results:
    print(f'{result["token_str"]}: {result["score"]}')
© www.soinside.com 2019 - 2024. All rights reserved.