ValueError:无法为变压器模型创建张量问题

问题描述 投票:0回答:1

我正在尝试在音频数据上训练一个整合模型,但不断出现以下错误。

"ValueError: Unable to create tensor, you should probly activate truncation and/or padding with 'padding=True' 'truncation=True' 以具有相同长度的批处理张量。也许你的特征(

input_ids
在这种情况下)有过度嵌套(输入类型
list
,其中类型
int
是预期的)。”

虽然我已经在 DataCollator 函数和 prepare_dataset 函数中将截断和填充值激活为 True,如下所示,但问题仍然存在。

数据整理者:

def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
    if not features:
        return {}
    # split inputs and labels since they have to be of different lenghts and need
    # different padding methods
    input_features = [{"input_values": feature["input_values"]} for feature in features]
    label_features = [{"input_ids": feature["labels"]} for feature in features]

    batch = self.processor.pad(
        input_features,
        padding=self.padding,
        max_length=self.max_length,
        truncation=True,  # added truncation parameter
        pad_to_multiple_of=self.pad_to_multiple_of,
        return_tensors="pt",
    )
    with self.processor.as_target_processor():
        labels_batch = self.processor.pad(
            label_features,
            padding=self.padding,
            max_length=self.max_length_labels,
            pad_to_multiple_of=self.pad_to_multiple_of_labels,
            return_tensors="pt",
        )

    # replace padding with -100 to ignore loss correctly
    labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

    batch["labels"] = labels

    return batch

prepare_dataset 函数:

MAX_TRANSCRIPTION_LENGTH=128

def prepare_dataset(batch):
speech, transcription = batch["audio"], batch["text"]

#speech_tensors = [torch.tensor(s["array"]) for s in speech]
speech_tensors = []
for s in speech:
    tensor = torch.tensor(s["array"])[:MAX_SPEECH_LENGTH] if isinstance(s, dict) else torch.zeros((1,))
    speech_tensors.append(tensor)

inputs = torch.nn.utils.rnn.pad_sequence(speech_tensors, batch_first=True, padding_value=0.0)

with tokenizer.as_target_tokenizer():
    labels = tokenizer(transcription, padding=True, truncation=True,add_special_tokens=True,max_length=MAX_TRANSCRIPTION_LENGTH,return_tensors="pt").input_ids

with tokenizer.as_target_tokenizer():
    input_ids = tokenizer(transcription, padding=True,truncation=True,add_special_tokens=True,max_length=MAX_TRANSCRIPTION_LENGTH, return_tensors="pt").input_ids

return {"input_values": inputs, "attention_mask": inputs != 0.0, "input_ids": input_ids, "labels": labels}

请帮忙解决这个问题

python tensor huggingface-transformers torch huggingface-tokenizers
1个回答
0
投票

padding=True, truncation=True 将使用默认的 max_length,尝试将标记器参数更改为:padding="max_length", truncation="max_length"

© www.soinside.com 2019 - 2024. All rights reserved.